Methods and Processes for Non-Invasive Assessment of Genetic Variations

ABSTRACT

Provided herein are methods, processes and apparatuses for non-invasive assessment of genetic variations that make use of nucleic acid fragments from circulating cell free nucleic acid. Also provided herein are methods for partitioning one or more genomic regions of a reference genome into a plurality of portions according to one or more features.

RELATED PATENT APPLICATIONS

The present application is a continuation of U.S. Non-Provisionalapplication Ser. No. 15/517,107, filed Apr. 5, 2017, entitled “MethodsAnd Processes For Non-Invasive Assessment Of Genetic Variation,” whichis a 35 U.S.C. 371 national phase patent application ofPCT/US2015/054903, filed on Oct. 9, 2015, entitled “Methods AndProcesses For Non-Invasive Assessment Of Genetic Variation,” whichclaims the benefit of U.S. Provisional Patent Application No. 62/062,748filed on Oct. 10, 2014, entitled “Methods And Processes For Non-InvasiveAssessment Of Genetic Variation.” The entire content of the foregoingapplications is incorporated herein by reference, including all text,tables and drawings, for all purposes.

FIELD

Technology provided herein relates in part to methods, processes andapparatuses for non-invasive assessment of genetic variations.

BACKGROUND

Genetic information of living organisms (e.g., animals, plants andmicroorganisms) and other forms of replicating genetic information(e.g., viruses) is encoded in deoxyribonucleic acid (DNA) or ribonucleicacid (RNA). Genetic information is a succession of nucleotides ormodified nucleotides representing the primary structure of chemical orhypothetical nucleic acids. In humans, the complete genome containsabout 30,000 genes located on twenty-four (24) chromosomes (see TheHuman Genome, T. Strachan, BIOS Scientific Publishers, 1992). Each geneencodes a specific protein, which after expression via transcription andtranslation fulfills a specific biochemical function within a livingcell.

Many medical conditions are caused by one or more genetic variations.Certain genetic variations cause medical conditions that include, forexample, hemophilia, thalassemia, Duchenne Muscular Dystrophy (DMD),Huntington's Disease (HD), Alzheimer's Disease and Cystic Fibrosis (CF)(Human Genome Mutations, D. N. Cooper and M. Krawczak, BIOS Publishers,1993). Such genetic diseases can result from an addition, substitution,or deletion of a single nucleotide in DNA of a particular gene. Certainbirth defects are caused by a chromosomal abnormality, also referred toas an aneuploidy, such as Trisomy 21 (Down's Syndrome), Trisomy 13(Patau Syndrome), Trisomy 18 (Edward's Syndrome), Trisomy 16, Trisomy22, Monosomy X (Turner's Syndrome) and certain sex chromosomeaneuploidies such as Klinefelter's Syndrome (XXY), for example. Anothergenetic variation is fetal gender, which can often be determined basedon sex chromosomes X and Y. Some genetic variations may predispose anindividual to, or cause, any of a number of diseases such as, forexample, diabetes, arteriosclerosis, obesity, various autoimmunediseases and cancer (e.g., colorectal, breast, ovarian, lung).

Identifying one or more genetic variations or variances can lead todiagnosis of, or determining predisposition to, a particular medicalcondition. Identifying a genetic variance can result in facilitating amedical decision and/or employing a helpful medical procedure. Incertain embodiments, identification of one or more genetic variations orvariances involves the analysis of cell-free DNA. Cell-free DNA (CF-DNA)is composed of DNA fragments that originate from cell death andcirculate in peripheral blood. High concentrations of CF-DNA can beindicative of certain clinical conditions such as cancer, trauma, burns,myocardial infarction, stroke, sepsis, infection, and other illnesses.Additionally, cell-free fetal DNA (CFF-DNA) can be detected in thematernal bloodstream and used for various noninvasive prenataldiagnostics.

SUMMARY

Provided herein, in certain aspects, are methods for partitioning one ormore genomic regions of a reference genome into a plurality of portionscomprising a) determining sequencing coverage variability across areference genome; b) selecting an initial portion length; c)partitioning at least two genomic regions according to the initialportion length in (b); d) comparing the sequencing coverage variabilitydetermined in (a) for each of the at least two genomic regions, therebygenerating a comparison; e) recalculating the number of portions for atleast one of the genomic regions according to the comparison in (d),thereby determining an optimized portion length; and f) re-partitioningat least one of the genomic regions into a plurality of portionsaccording to the optimized portion length in (e).

Also provided herein, in certain aspects, are methods for partitioningone or more genomic regions of a reference genome into a plurality ofportions comprising a) determining sequencing coverage variabilityacross a reference genome; b) selecting an initial portion length; c)partitioning at least two genomic regions according to the initialportion length in (b); d) comparing the sequencing coverage variabilitydetermined in (a) for each of the at least two genomic regions, therebygenerating a comparison; e) recalculating the number of portions for atleast one of the genomic regions according to the comparison in (d),thereby determining an optimized portion length; f) re-partitioning atleast one of the genomic regions into a plurality of portions accordingto the optimized portion length in (e), thereby generating are-partitioned genomic region; g) estimating fetal fraction for a testsample from a pregnant female bearing a fetus; h) determining a minimumgenomic region size; and i) adjusting the number of portions for eachgenomic region to comprise at least two portions, thereby generating arefined re-partitioned genomic region.

Also provided herein, in certain aspects, are methods for partitioningone or more genomic regions of a reference genome into a plurality ofportions comprising: a) determining sequencing coverage variabilityacross a reference genome; b) selecting an initial portion length; c)partitioning at least two genomic regions according to the initialportion length in (b); d) comparing the sequencing coverage variabilitydetermined in (a) for each of the at least two genomic regions, therebygenerating a comparison; e) recalculating the number of portions for atleast one of the genomic regions according to the comparison in (d),thereby determining an optimized portion length; f) re-partitioning atleast one of the genomic regions into a plurality of portions accordingto the optimized portion length in (e), thereby generating are-partitioned genomic region; g) determining a region-specific fetalfraction for each genomic region according to a correlation betweennucleotide sequence read counts per portion and a weighting factor; h)determining a local minimum genomic region size; and i) adjusting thenumber of portions for each genomic region to comprise at least twoportions, thereby generating a refined re-partitioned genomic region.

Also provided herein, in certain aspects, are methods for partitioningone or more genomic regions of a reference genome into a plurality ofportions comprising: a) determining sequencing coverage variabilityacross a reference genome; b) selecting an initial portion length; c)partitioning at least two genomic regions according to the initialportion length in (b); d) comparing the sequencing coverage variabilitydetermined in (a) for each of the at least two genomic regions, therebygenerating a comparison; e) recalculating the number of portions for atleast one of the genomic regions according to the comparison in (d),thereby determining an optimized portion length; f) re-partitioning atleast one of the genomic regions into a plurality of portions accordingto the optimized portion length in (e), thereby generating are-partitioned genomic region; g) estimating fetal fraction for a testsample from a pregnant female bearing a fetus; h) determining aregion-specific fetal fraction for each genomic region according to acorrelation between nucleotide sequence read counts per portion and aweighting factor; i) determining a local minimum genomic region size;and j) adjusting the number of portions for each genomic region tocomprise at least two portions, thereby generating a refinedre-partitioned genomic region.

Also provided herein, in certain aspects, are methods for partitioningone or more genomic regions of a reference genome into a plurality ofportions comprising: a) determining sequencing coverage variabilityacross a reference genome; b) selecting an initial portion length; c)partitioning at least two genomic regions according to the initialportion length in (b); d) determining a region-specific fetal fractionfor each genomic region according to a correlation between nucleotidesequence read counts per portion and a weighting factor; e) determininga local minimum genomic region size; and f) adjusting the number ofportions for each genomic region to comprise at least two portions,thereby generating a re-partitioned genomic region.

Also provided herein, in certain aspects, are methods for partitioning areference genome, or part thereof, into a plurality of portionscomprising: a) generating a guanine and cytosine (GC) profile for areference genome, or part thereof; b) applying a segmenting process tothe GC profile generated in (a), thereby providing discrete segments;and c) partitioning the reference genome, or part thereof, into aplurality of portions according to the discrete segments provided in(b), thereby generating a GC partitioned reference genome, or partthereof.

Certain embodiments are described further in the following description,examples, claims and drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

The drawings illustrate certain embodiments of the technology and arenot limiting. For clarity and ease of illustration, the drawings are notmade to scale and, in some instances, various aspects may be shownexaggerated or enlarged to facilitate an understanding of particularembodiments.

FIG. 1 shows a genomic region partitioned into segments of similar GCcontent using a wavelet binning method.

FIG. 2 shows an example of bin size distribution using a wavelet binningmethod.

FIG. 3 shows classification results for chromosomes 21, 18 and 13 usinga wavelet binning method and a 50 kb binning method on the LDTv4CE2study. The accuracy is identical for the two methods.

FIG. 4 shows truth tables for chromosomes 21, 18 and 13 for the LDTv4CE2study.

FIG. 5 shows an example workflow using a wavelet binning method.

FIG. 6 shows z-score distributions for euploid events (left peak) andtrisomic events (right peak). The shaded area represents false negative(FN) rate a.

FIG. 7 shows minimum fetal fraction for microdeletion and/ormicroduplication detection to achieve a population level false negative(FN) rate of 1%.

FIG. 8 presents a list of certain examples of microdeletions andmicroduplications.

FIG. 9 shows a LOESS regression plot of normalized counts vs. enet bincoefficients (excluding bins with 0 coefficient) for three samples.Sample 1 had low fetal fraction (about 5%), sample 2 had medium fetalfraction (about 10%), and sample 3 had high fetal fraction (about 20%).

FIG. 10 shows certain steps that can be used in an optimaldiscretization method.

FIG. 11 shows example workflows for an optimal discretization methodusing certain steps presented in FIG. 10.

FIG. 12 shows an illustrative embodiment of a system in which certainembodiments of the technology may be implemented.

DETAILED DESCRIPTION

Provided herein are methods for determining the presence or absence of agenetic variation (e.g., a chromosome aneuploidy, microduplication ormicrodeletion) where a determination is made, in part and/or in full,according to nucleic acid sequences. Also provided herein are methodsfor partitioning one or more genomic regions of a reference genome intoa plurality of portions according to sequencing coverage variabilityand/or sequence content (e.g., guanine and cytosine (GC) content). Insome embodiments nucleic acid sequences are obtained from a sampleobtained from a pregnant female (e.g., from the blood of a pregnantfemale). Also provided herein are improved data manipulation methods aswell as systems, apparatuses and modules that, in some embodiments,carry out the methods described herein. In some embodiments, identifyinga genetic variation by a method described herein can lead to a diagnosisof, or determine a predisposition to, a particular medical condition.Identifying a genetic variance can result in facilitating a medicaldecision and/or employing a helpful medical procedure.

Samples

Provided herein are methods and compositions for analyzing nucleic acid.In some embodiments, nucleic acid fragments in a mixture of nucleic acidfragments are analyzed. A mixture of nucleic acids can comprise two ormore nucleic acid fragment species having different nucleotidesequences, different fragment lengths, different origins (e.g., genomicorigins, fetal vs. maternal origins, cell or tissue origins, sampleorigins, subject origins, and the like), or combinations thereof.

Nucleic acid or a nucleic acid mixture utilized in methods andapparatuses described herein often is isolated from a sample obtainedfrom a subject. A subject can be any living or non-living organism,including but not limited to a human, a non-human animal, a plant, abacterium, a fungus or a protist. Any human or non-human animal can beselected, including but not limited to mammal, reptile, avian,amphibian, fish, ungulate, ruminant, bovine (e.g., cattle), equine(e.g., horse), caprine and ovine (e.g., sheep, goat), swine (e.g., pig),camelid (e.g., camel, llama, alpaca), monkey, ape (e.g., gorilla,chimpanzee), ursid (e.g., bear), poultry, dog, cat, mouse, rat, fish,dolphin, whale and shark. A subject may be a male or female (e.g.,woman, a pregnant woman). A subject may be any age (e.g., an embryo, afetus, infant, child, adult).

Nucleic acid may be isolated from any type of suitable biologicalspecimen or sample (e.g., a test sample). A sample or test sample can beany specimen that is isolated or obtained from a subject or part thereof(e.g., a human subject, a pregnant female, a fetus). Non-limitingexamples of specimens include fluid or tissue from a subject, including,without limitation, blood or a blood product (e.g., serum, plasma, orthe like), umbilical cord blood, chorionic villi, amniotic fluid,cerebrospinal fluid, spinal fluid, lavage fluid (e.g., bronchoalveolar,gastric, peritoneal, ductal, ear, arthroscopic), biopsy sample (e.g.,from pre-implantation embryo), celocentesis sample, cells (blood cells,placental cells, embryo or fetal cells, fetal nucleated cells or fetalcellular remnants) or parts thereof (e.g., mitochondrial, nucleus,extracts, or the like), washings of female reproductive tract, urine,feces, sputum, saliva, nasal mucous, prostate fluid, lavage, semen,lymphatic fluid, bile, tears, sweat, breast milk, breast fluid, the likeor combinations thereof. In some embodiments, a biological sample is acervical swab from a subject. In some embodiments, a biological samplemay be blood and sometimes plasma or serum. The term “blood” as usedherein refers to a blood sample or preparation from a pregnant woman ora woman being tested for possible pregnancy. The term encompasses wholeblood, blood product or any fraction of blood, such as serum, plasma,buffy coat, or the like as conventionally defined. Blood or fractionsthereof often comprise nucleosomes (e.g., maternal and/or fetalnucleosomes). Nucleosomes comprise nucleic acids and are sometimescell-free or intracellular. Blood also comprises buffy coats. Buffycoats are sometimes isolated by utilizing a ficoll gradient. Buffy coatscan comprise white blood cells (e.g., leukocytes, T-cells, B-cells,platelets, and the like). In certain embodiments buffy coats comprisematernal and/or fetal nucleic acid. Blood plasma refers to the fractionof whole blood resulting from centrifugation of blood treated withanticoagulants. Blood serum refers to the watery portion of fluidremaining after a blood sample has coagulated. Fluid or tissue samplesoften are collected in accordance with standard protocols hospitals orclinics generally follow. For blood, an appropriate amount of peripheralblood (e.g., between 3-40 milliliters) often is collected and can bestored according to standard procedures prior to or after preparation. Afluid or tissue sample from which nucleic acid is extracted may beacellular (e.g., cell-free). In some embodiments, a fluid or tissuesample may contain cellular elements or cellular remnants. In someembodiments fetal cells or cancer cells may be included in the sample.

A sample often is heterogeneous, by which is meant that more than onetype of nucleic acid species is present in the sample. For example,heterogeneous nucleic acid can include, but is not limited to, (i) fetalderived and maternal derived nucleic acid, (ii) cancer and non-cancernucleic acid, (iii) pathogen and host nucleic acid, and more generally,(iv) mutated and wild-type nucleic acid. A sample may be heterogeneousbecause more than one cell type is present, such as a fetal cell and amaternal cell, a cancer and non-cancer cell, or a pathogenic and hostcell. In some embodiments, a minority nucleic acid species and amajority nucleic acid species is present.

For prenatal applications of technology described herein, fluid ortissue sample may be collected from a female at a gestational agesuitable for testing, or from a female who is being tested for possiblepregnancy. Suitable gestational age may vary depending on the prenataltest being performed. In certain embodiments, a pregnant female subjectsometimes is in the first trimester of pregnancy, at times in the secondtrimester of pregnancy, or sometimes in the third trimester ofpregnancy. In certain embodiments, a fluid or tissue is collected from apregnant female between about 1 to about 45 weeks of fetal gestation(e.g., at 1-4, 4-8, 8-12, 12-16, 16-20, 20-24, 24-28, 28-32, 32-36,36-40 or 40-44 weeks of fetal gestation), and sometimes between about 5to about 28 weeks of fetal gestation (e.g., at 6, 7, 8, 9, 10, 11, 12,13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26 or 27 weeks offetal gestation). In certain embodiments a fluid or tissue sample iscollected from a pregnant female during or just after (e.g., 0 to 72hours after) giving birth (e.g., vaginal or non-vaginal birth (e.g.,surgical delivery)).

Acquisition of Blood Samples and Extraction of DNA

Methods herein often include separating, enriching and analyzing fetalDNA found in maternal blood as a non-invasive means to detect thepresence or absence of a maternal and/or fetal genetic variation and/orto monitor the health of a fetus and/or a pregnant female during andsometimes after pregnancy. Thus, the first steps of practicing certainmethods herein often include obtaining a blood sample from a pregnantwoman and extracting DNA from a sample.

Acquisition of Blood Samples

A blood sample can be obtained from a pregnant woman at a gestationalage suitable for testing using a method of the present technology. Asuitable gestational age may vary depending on the disorder tested, asdiscussed below. Collection of blood from a woman often is performed inaccordance with the standard protocol hospitals or clinics generallyfollow. An appropriate amount of peripheral blood, e.g., typicallybetween 5-50 ml, often is collected and may be stored according tostandard procedure prior to further preparation. Blood samples may becollected, stored or transported in a manner that minimizes degradationor the quality of nucleic acid present in the sample.

Preparation of Blood Samples

An analysis of fetal DNA found in maternal blood may be performed using,e.g., whole blood, serum, or plasma. Methods for preparing serum orplasma from maternal blood are known. For example, a pregnant woman'sblood can be placed in a tube containing EDTA or a specializedcommercial product such as Vacutainer SST (Becton Dickinson, FranklinLakes, N.J.) to prevent blood clotting, and plasma can then be obtainedfrom whole blood through centrifugation. Serum may be obtained with orwithout centrifugation-following blood clotting. If centrifugation isused then it is typically, though not exclusively, conducted at anappropriate speed, e.g., 1,500-3,000 times g. Plasma or serum may besubjected to additional centrifugation steps before being transferred toa fresh tube for DNA extraction.

In addition to the acellular portion of the whole blood, DNA may also berecovered from the cellular fraction, enriched in the buffy coatportion, which can be obtained following centrifugation of a whole bloodsample from the woman and removal of the plasma.

Extraction of DNA

There are numerous known methods for extracting DNA from a biologicalsample including blood. The general methods of DNA preparation (e.g.,described by Sambrook and Russell, Molecular Cloning: A LaboratoryManual 3d ed., 2001) can be followed; various commercially availablereagents or kits, such as Qiagen's QIAamp Circulating Nucleic Acid Kit,QiaAmp DNA Mini Kit or QiaAmp DNA Blood Mini Kit (Qiagen, Hilden,Germany), GenomicPrep™ Blood DNA Isolation Kit (Promega, Madison, Wis.),and GFX™ Genomic Blood DNA Purification Kit (Amersham, Piscataway,N.J.), may also be used to obtain DNA from a blood sample from apregnant woman. Combinations of more than one of these methods may alsobe used.

In some embodiments, the sample may first be enriched or relativelyenriched for fetal nucleic acid by one or more methods. For example, thediscrimination of fetal and maternal DNA can be performed using thecompositions and processes of the present technology alone or incombination with other discriminating factors. Examples of these factorsinclude, but are not limited to, single nucleotide differences betweenchromosome X and Y, chromosome Y-specific sequences, polymorphismslocated elsewhere in the genome, size differences between fetal andmaternal DNA and differences in methylation pattern between maternal andfetal tissues.

Other methods for enriching a sample for a particular species of nucleicacid are described in PCT Patent Application Number PCT/US07/69991,filed May 30, 2007, PCT Patent Application Number PCT/US2007/071232,filed Jun. 15, 2007, U.S. Provisional Application Nos. 60/968,876 and60/968,878 (assigned to the Applicant), (PCT Patent Application NumberPCT/EP05/012707, filed Nov. 28, 2005) which are all hereby incorporatedby reference. In certain embodiments, maternal nucleic acid isselectively removed (either partially, substantially, almost completelyor completely) from the sample.

The terms “nucleic acid” and “nucleic acid molecule” may be usedinterchangeably throughout the disclosure. The terms refer to nucleicacids of any composition from, such as DNA (e.g., complementary DNA(cDNA), genomic DNA (gDNA) and the like), RNA (e.g., message RNA (mRNA),short inhibitory RNA (siRNA), ribosomal RNA (rRNA), tRNA, microRNA, RNAhighly expressed by the fetus or placenta, and the like), and/or DNA orRNA analogs (e.g., containing base analogs, sugar analogs and/or anon-native backbone and the like), RNA/DNA hybrids and polyamide nucleicacids (PNAs), all of which can be in single- or double-stranded form,and unless otherwise limited, can encompass known analogs of naturalnucleotides that can function in a similar manner as naturally occurringnucleotides. A nucleic acid may be, or may be from, a plasmid, phage,autonomously replicating sequence (ARS), centromere, artificialchromosome, chromosome, or other nucleic acid able to replicate or bereplicated in vitro or in a host cell, a cell, a cell nucleus orcytoplasm of a cell in certain embodiments. A template nucleic acid insome embodiments can be from a single chromosome (e.g., a nucleic acidsample may be from one chromosome of a sample obtained from a diploidorganism). Unless specifically limited, the term encompasses nucleicacids containing known analogs of natural nucleotides that have similarbinding properties as the reference nucleic acid and are metabolized ina manner similar to naturally occurring nucleotides. Unless otherwiseindicated, a particular nucleic acid sequence also implicitlyencompasses conservatively modified variants thereof (e.g., degeneratecodon substitutions), alleles, orthologs, single nucleotidepolymorphisms (SNPs), and complementary sequences as well as thesequence explicitly indicated. Specifically, degenerate codonsubstitutions may be achieved by generating sequences in which the thirdposition of one or more selected (or all) codons is substituted withmixed-base and/or deoxyinosine residues. The term nucleic acid is usedinterchangeably with locus, gene, cDNA, and mRNA encoded by a gene. Theterm also may include, as equivalents, derivatives, variants and analogsof RNA or DNA synthesized from nucleotide analogs, single-stranded(“sense” or “antisense”, “plus” strand or “minus” strand, “forward”reading frame or “reverse” reading frame) and double-strandedpolynucleotides. The term “gene” means the segment of DNA involved inproducing a polypeptide chain; it includes regions preceding andfollowing the coding region (leader and trailer) involved in thetranscription/translation of the gene product and the regulation of thetranscription/translation, as well as intervening sequences (introns)between individual coding segments (exons). Deoxyribonucleotides includedeoxyadenosine, deoxycytidine, deoxyguanosine and deoxythymidine. ForRNA, the base cytosine is replaced with uracil. A template nucleic acidmay be prepared using a nucleic acid obtained from a subject as atemplate.

Nucleic Acid Isolation and Processing

Nucleic acid may be derived from one or more sources (e.g., cells,serum, plasma, buffy coat, lymphatic fluid, skin, soil, and the like) bymethods known in the art. Any suitable method can be used for isolating,extracting and/or purifying DNA from a biological sample (e.g., fromblood or a blood product), non-limiting examples of which includemethods of DNA preparation (e.g., described by Sambrook and Russell,Molecular Cloning: A Laboratory Manual 3d ed., 2001), variouscommercially available reagents or kits, such as Qiagen's QIAampCirculating Nucleic Acid Kit, QiaAmp DNA Mini Kit or QiaAmp DNA BloodMini Kit (Qiagen, Hilden, Germany), GenomicPrep™ Blood DNA Isolation Kit(Promega, Madison, Wis.), and GFX™ Genomic Blood DNA Purification Kit(Amersham, Piscataway, N.J.), the like or combinations thereof.

Cell lysis procedures and reagents are known in the art and maygenerally be performed by chemical (e.g., detergent, hypotonicsolutions, enzymatic procedures, and the like, or combination thereof),physical (e.g., French press, sonication, and the like), or electrolyticlysis methods. Any suitable lysis procedure can be utilized. Forexample, chemical methods generally employ lysing agents to disruptcells and extract the nucleic acids from the cells, followed bytreatment with chaotropic salts. Physical methods such as freeze/thawfollowed by grinding, the use of cell presses and the like also areuseful. High salt lysis procedures also are commonly used. For example,an alkaline lysis procedure may be utilized. The latter proceduretraditionally incorporates the use of phenol-chloroform solutions, andan alternative phenol-chloroform-free procedure involving threesolutions can be utilized. In the latter procedures, one solution cancontain 15 mM Tris, pH 8.0; 10 mM EDTA and 100 μg/ml Rnase A; a secondsolution can contain 0.2N NaOH and 1% SDS; and a third solution cancontain 3M KOAc, pH 5.5. These procedures can be found in CurrentProtocols in Molecular Biology, John Wiley & Sons, N.Y., 6.3.1-6.3.6(1989), incorporated herein in its entirety.

Nucleic acid may be isolated at a different time point as compared toanother nucleic acid, where each of the samples is from the same or adifferent source. A nucleic acid may be from a nucleic acid library,such as a cDNA or RNA library, for example. A nucleic acid may be aresult of nucleic acid purification or isolation and/or amplification ofnucleic acid molecules from the sample. Nucleic acid provided forprocesses described herein may contain nucleic acid from one sample orfrom two or more samples (e.g., from 1 or more, 2 or more, 3 or more, 4or more, 5 or more, 6 or more, 7 or more, 8 or more, 9 or more, 10 ormore, 11 or more, 12 or more, 13 or more, 14 or more, 15 or more, 16 ormore, 17 or more, 18 or more, 19 or more, or 20 or more samples).

Nucleic acids can include extracellular nucleic acid in certainembodiments. The term “extracellular nucleic acid” as used herein canrefer to nucleic acid isolated from a source having substantially nocells and also is referred to as “cell-free” nucleic acid and/or“cell-free circulating” nucleic acid. Extracellular nucleic acid can bepresent in and obtained from blood (e.g., from the blood of a pregnantfemale). Extracellular nucleic acid often includes no detectable cellsand may contain cellular elements or cellular remnants. Non-limitingexamples of acellular sources for extracellular nucleic acid are blood,blood plasma, blood serum and urine. As used herein, the term “obtaincell-free circulating sample nucleic acid” includes obtaining a sampledirectly (e.g., collecting a sample, e.g., a test sample) or obtaining asample from another who has collected a sample. Without being limited bytheory, extracellular nucleic acid may be a product of cell apoptosisand cell breakdown, which provides basis for extracellular nucleic acidoften having a series of lengths across a spectrum (e.g., a “ladder”).

Extracellular nucleic acid can include different nucleic acid species,and therefore is referred to herein as “heterogeneous” in certainembodiments. For example, blood serum or plasma from a person havingcancer can include nucleic acid from cancer cells and nucleic acid fromnon-cancer cells. In another example, blood serum or plasma from apregnant female can include maternal nucleic acid and fetal nucleicacid. In some instances, fetal nucleic acid sometimes is about 5% toabout 50% of the overall nucleic acid (e.g., about 4, 5, 6, 7, 8, 9, 10,11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46,47, 48, or 49% of the total nucleic acid is fetal nucleic acid). In someembodiments, the majority of fetal nucleic acid in nucleic acid is of alength of about 500 base pairs or less, about 250 base pairs or less,about 200 base pairs or less, about 150 base pairs or less, about 100base pairs or less, about 50 base pairs or less or about 25 base pairsor less. In some embodiments, the majority of fetal nucleic acid innucleic acid is of a length of about 500 base pairs or less (e.g., about80, 85, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99 or 100% of fetal nucleicacid is of a length of about 500 base pairs or less). In someembodiments, the majority of fetal nucleic acid in nucleic acid is of alength of about 250 base pairs or less (e.g., about 80, 85, 90, 91, 92,93, 94, 95, 96, 97, 98, 99 or 100% of fetal nucleic acid is of a lengthof about 250 base pairs or less). In some embodiments, the majority offetal nucleic acid in nucleic acid is of a length of about 200 basepairs or less (e.g., about 80, 85, 90, 91, 92, 93, 94, 95, 96, 97, 98,99 or 100% of fetal nucleic acid is of a length of about 200 base pairsor less). In some embodiments, the majority of fetal nucleic acid innucleic acid is of a length of about 150 base pairs or less (e.g., about80, 85, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99 or 100% of fetal nucleicacid is of a length of about 150 base pairs or less). In someembodiments, the majority of fetal nucleic acid in nucleic acid is of alength of about 100 base pairs or less (e.g., about 80, 85, 90, 91, 92,93, 94, 95, 96, 97, 98, 99 or 100% of fetal nucleic acid is of a lengthof about 100 base pairs or less). In some embodiments, the majority offetal nucleic acid in nucleic acid is of a length of about 50 base pairsor less (e.g., about 80, 85, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99 or100% of fetal nucleic acid is of a length of about 50 base pairs orless). In some embodiments, the majority of fetal nucleic acid innucleic acid is of a length of about 25 base pairs or less (e.g., about80, 85, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99 or 100% of fetal nucleicacid is of a length of about 25 base pairs or less).

In some embodiments, nucleic acid fragments of a certain length, rangeof lengths, or lengths under or over a particular threshold or cutoffare analyzed. In some embodiments, fragments having a length under aparticular threshold or cutoff (e.g., 500 bp, 400 bp, 300 bp, 200 bp,150 bp, 100 bp) are referred to as “short” fragments and fragmentshaving a length over a particular threshold or cutoff (e.g., 500 bp, 400bp, 300 bp, 200 bp, 150 bp, 100 bp) are referred to as “long” fragments.For example, fragments having a length under 200 bp are referred to as“short” fragments and fragments having a length equal to or over 200 bpare referred to as “long” fragments. In some embodiments, fragments of acertain length, range of lengths, or lengths under or over a particularthreshold or cutoff are analyzed while fragments of a different lengthor range of lengths, or lengths over or under the threshold or cutoffare not analyzed. In some embodiments, fragments of a certain length,range of lengths, or lengths under a particular threshold or cutoff areanalyzed separately from fragments of a different length or range oflengths, or lengths over the threshold or cutoff.

In some embodiments, fragments that are less than about 500 bp areanalyzed. In some embodiments, fragments that are less than about 400 bpare analyzed. In some embodiments, fragments that are less than about300 bp are analyzed. In some embodiments, fragments that are less thanabout 200 bp are analyzed. In some embodiments, fragments that are lessthan about 150 bp are analyzed. For example, fragments that are lessthan about 200 bp, 190 bp, 180 bp, 170 bp, 160 bp, 150 bp, 140 bp, 130bp, 120 bp, 110 bp or 100 bp are analyzed. In some embodiments,fragments that are about 100 bp to about 200 bp are analyzed. Forexample, fragments that are about 190 bp, 180 bp, 170 bp, 160 bp, 150bp, 140 bp, 130 bp, 120 by or 110 by are analyzed. In some embodiments,fragments that are in the range of about 100 bp to about 200 bp areanalyzed. For example, fragments that are in the range of about 110 bpto about 190 bp, 130 bp to about 180 bp, 140 bp to about 170 bp, 140 bpto about 150 bp, 150 bp to about 160 bp, 145 bp to about 155 bp, or 130bp to 140 bp are analyzed. In some embodiments, fragments that are about135 bp are analyzed. In some embodiments, fragments that are about 200bp or greater are analyzed. In some embodiments, fragments that areabout 200 bp are analyzed. In some embodiments, fragments that are about10 bp to about 30 bp shorter than other fragments of a certain length orrange of lengths are analyzed. In some embodiments, fragments that areabout 10 bp to about 20 bp shorter than other fragments of a certainlength or range of lengths are analyzed. In some embodiments, fragmentsthat are about 10 bp to about 15 bp shorter than other fragments of acertain length or range of lengths are analyzed.

Nucleic acid may be provided for conducting methods described hereinwithout processing of the sample(s) containing the nucleic acid, incertain embodiments. In some embodiments, nucleic acid is provided forconducting methods described herein after processing of the sample(s)containing the nucleic acid. For example, a nucleic acid can beextracted, isolated, purified, partially purified or amplified from thesample(s). The term “isolated” as used herein refers to nucleic acidremoved from its original environment (e.g., the natural environment ifit is naturally occurring, or a host cell if expressed exogenously), andthus is altered by human intervention (e.g., “by the hand of man”) fromits original environment. The term “isolated nucleic acid” as usedherein can refer to a nucleic acid removed from a subject (e.g., a humansubject). An isolated nucleic acid can be provided with fewernon-nucleic acid components (e.g., protein, lipid) than the amount ofcomponents present in a source sample. A composition comprising isolatednucleic acid can be about 50% to greater than 99% free of non-nucleicacid components. A composition comprising isolated nucleic acid can beabout 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99% or greater than99% free of non-nucleic acid components. The term “purified” as usedherein can refer to a nucleic acid provided that contains fewernon-nucleic acid components (e.g., protein, lipid, carbohydrate) thanthe amount of non-nucleic acid components present prior to subjectingthe nucleic acid to a purification procedure. A composition comprisingpurified nucleic acid may be about 80%, 81%, 82%, 83%, 84%, 85%, 86%,87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99% orgreater than 99% free of other non-nucleic acid components. The term“purified” as used herein can refer to a nucleic acid provided thatcontains fewer nucleic acid species than in the sample source from whichthe nucleic acid is derived. A composition comprising purified nucleicacid may be about 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99% orgreater than 99% free of other nucleic acid species. For example, fetalnucleic acid can be purified from a mixture comprising maternal andfetal nucleic acid. In certain examples, nucleosomes comprising smallfragments of fetal nucleic acid can be purified from a mixture of largernucleosome complexes comprising larger fragments of maternal nucleicacid.

In some embodiments nucleic acids are fragmented or cleaved prior to,during or after a method described herein. Fragmented or cleaved nucleicacid may have a nominal, average or mean length of about 5 to about10,000 base pairs, about 100 to about 1,000 base pairs, about 100 toabout 500 base pairs, or about 10, 15, 20, 25, 30, 35, 40, 45, 50, 55,60, 65, 70, 75, 80, 85, 90, 95, 100, 200, 300, 400, 500, 600, 700, 800,900, 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000 or 9000 base pairs.Fragments can be generated by a suitable method known in the art, andthe average, mean or nominal length of nucleic acid fragments can becontrolled by selecting an appropriate fragment-generating procedure.

Nucleic acid fragments may contain overlapping nucleotide sequences, andsuch overlapping sequences can facilitate construction of a nucleotidesequence of the non-fragmented counterpart nucleic acid, or a segmentthereof. For example, one fragment may have subsequences x and y andanother fragment may have subsequences y and z, where x, y and z arenucleotide sequences that can be 5 nucleotides in length or greater.Overlap sequence y can be utilized to facilitate construction of thex-y-z nucleotide sequence in nucleic acid from a sample in certainembodiments. Nucleic acid may be partially fragmented (e.g., from anincomplete or terminated specific cleavage reaction) or fully fragmentedin certain embodiments.

In some embodiments nucleic acid is fragmented or cleaved by a suitablemethod, non-limiting examples of which include physical methods (e.g.,shearing, e.g., sonication, French press, heat, UV irradiation, thelike), enzymatic processes (e.g., enzymatic cleavage agents (e.g., asuitable nuclease, a suitable restriction enzyme, a suitable methylationsensitive restriction enzyme)), chemical methods (e.g., alkylation, DMS,piperidine, acid hydrolysis, base hydrolysis, heat, the like, orcombinations thereof), processes described in U.S. Patent ApplicationPublication No. 20050112590, the like or combinations thereof.

As used herein, “fragmentation” or “cleavage” refers to a procedure orconditions in which a nucleic acid molecule, such as a nucleic acidtemplate gene molecule or amplified product thereof, may be severed intotwo or more smaller nucleic acid molecules. Such fragmentation orcleavage can be sequence specific, base specific, or nonspecific, andcan be accomplished by any of a variety of methods, reagents orconditions, including, for example, chemical, enzymatic, physicalfragmentation.

As used herein, “fragments”, “cleavage products”, “cleaved products” orgrammatical variants thereof, refers to nucleic acid molecules resultantfrom a fragmentation or cleavage of a nucleic acid template genemolecule or amplified product thereof. While such fragments or cleavedproducts can refer to all nucleic acid molecules resultant from acleavage reaction, typically such fragments or cleaved products referonly to nucleic acid molecules resultant from a fragmentation orcleavage of a nucleic acid template gene molecule or the segment of anamplified product thereof containing the corresponding nucleotidesequence of a nucleic acid template gene molecule. The term “amplified”as used herein refers to subjecting a target nucleic acid in a sample toa process that linearly or exponentially generates amplicon nucleicacids having the same or substantially the same nucleotide sequence asthe target nucleic acid, or segment thereof. In certain embodiments theterm “amplified” refers to a method that comprises a polymerase chainreaction (PCR). For example, an amplified product can contain one ormore nucleotides more than the amplified nucleotide region of a nucleicacid template sequence (e.g., a primer can contain “extra” nucleotidessuch as a transcriptional initiation sequence, in addition tonucleotides complementary to a nucleic acid template gene molecule,resulting in an amplified product containing “extra” nucleotides ornucleotides not corresponding to the amplified nucleotide region of thenucleic acid template gene molecule). Accordingly, fragments can includefragments arising from segments or parts of amplified nucleic acidmolecules containing, at least in part, nucleotide sequence informationfrom or based on the representative nucleic acid template molecule.

As used herein, the term “complementary cleavage reactions” refers tocleavage reactions that are carried out on the same nucleic acid usingdifferent cleavage reagents or by altering the cleavage specificity ofthe same cleavage reagent such that alternate cleavage patterns of thesame target or reference nucleic acid or protein are generated. Incertain embodiments, nucleic acid may be treated with one or morespecific cleavage agents (e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or morespecific cleavage agents) in one or more reaction vessels (e.g., nucleicacid is treated with each specific cleavage agent in a separate vessel).The term “specific cleavage agent” as used herein refers to an agent,sometimes a chemical or an enzyme that can cleave a nucleic acid at oneor more specific sites.

Nucleic acid also may be exposed to a process that modifies certainnucleotides in the nucleic acid before providing nucleic acid for amethod described herein. A process that selectively modifies nucleicacid based upon the methylation state of nucleotides therein can beapplied to nucleic acid, for example. In addition, conditions such ashigh temperature, ultraviolet radiation, x-radiation, can induce changesin the sequence of a nucleic acid molecule. Nucleic acid may be providedin any suitable form useful for conducting a suitable sequence analysis.

Nucleic acid may be single or double stranded. Single stranded DNA, forexample, can be generated by denaturing double stranded DNA by heatingor by treatment with alkali, for example.

In certain embodiments, nucleic acid is in a D-loop structure, formed bystrand invasion of a duplex DNA molecule by an oligonucleotide or aDNA-like molecule such as peptide nucleic acid (PNA). D loop formationcan be facilitated by addition of E. Coli RecA protein and/or byalteration of salt concentration, for example, using methods known inthe art.

Determining Fetal Nucleic Acid Content

The amount of fetal nucleic acid (e.g., concentration, relative amount,absolute amount, copy number, and the like) in nucleic acid isdetermined in some embodiments. In certain embodiments, the amount offetal nucleic acid in a sample is referred to as “fetal fraction”. Insome embodiments “fetal fraction” refers to the fraction of fetalnucleic acid in circulating cell-free nucleic acid in a sample (e.g., ablood sample, a serum sample, a plasma sample) obtained from a pregnantfemale. In certain embodiments, the amount of fetal nucleic acid isdetermined according to markers specific to a male fetus (e.g.,Y-chromosome STR markers (e.g., DYS 19, DYS 385, DYS 392 markers); RhDmarker in RhD-negative females), allelic ratios of polymorphicsequences, or according to one or more markers specific to fetal nucleicacid and not maternal nucleic acid (e.g., differential epigeneticbiomarkers (e.g., methylation; described in further detail below)between mother and fetus, or fetal RNA markers in maternal blood plasma(see e.g., Lo, 2005, Journal of Histochemistry and Cytochemistry 53 (3):293-296)).

Determination of fetal nucleic acid content (e.g., fetal fraction)sometimes is performed using a fetal quantifier assay (FQA) asdescribed, for example, in U.S. Patent Application Publication No.2010/0105049, which is hereby incorporated by reference. This type ofassay allows for the detection and quantification of fetal nucleic acidin a maternal sample based on the methylation status of the nucleic acidin the sample. In certain embodiments, the amount of fetal nucleic acidfrom a maternal sample can be determined relative to the total amount ofnucleic acid present, thereby providing the percentage of fetal nucleicacid in the sample. In certain embodiments, the copy number of fetalnucleic acid can be determined in a maternal sample. In certainembodiments, the amount of fetal nucleic acid can be determined in asequence-specific (or portion-specific) manner and sometimes withsufficient sensitivity to allow for accurate chromosomal dosage analysis(for example, to detect the presence or absence of a fetal aneuploidy,microduplication or microdeletion).

A fetal quantifier assay (FQA) can be performed in conjunction with anyof the methods described herein. Such an assay can be performed by anymethod known in the art and/or described in U.S. Patent ApplicationPublication No. 2010/0105049, such as, for example, by a method that candistinguish between maternal and fetal DNA based on differentialmethylation status, and quantify (i.e. determine the amount of) thefetal DNA. Methods for differentiating nucleic acid based on methylationstatus include, but are not limited to, methylation sensitive capture,for example, using a MBD2-Fc fragment in which the methyl binding domainof MBD2 is fused to the Fc fragment of an antibody (MBD-FC) (Gebhard etal. (2006) Cancer Res. 66(12):6118-28); methylation specific antibodies;bisulfite conversion methods, for example, MSP (methylation-sensitivePCR), COBRA, methylation-sensitive single nucleotide primer extension(Ms-SNuPE) or Sequenom MassCLEAVE™ technology; and the use ofmethylation sensitive restriction enzymes (e.g., digestion of maternalDNA in a maternal sample using one or more methylation sensitiverestriction enzymes thereby enriching the fetal DNA). Methyl-sensitiveenzymes also can be used to differentiate nucleic acid based onmethylation status, which, for example, can preferentially orsubstantially cleave or digest at their DNA recognition sequence if thelatter is non-methylated. Thus, an unmethylated DNA sample will be cutinto smaller fragments than a methylated DNA sample and ahypermethylated DNA sample will not be cleaved. Except where explicitlystated, any method for differentiating nucleic acid based on methylationstatus can be used with the compositions and methods of the technologyherein. The amount of fetal DNA can be determined, for example, byintroducing one or more competitors at known concentrations during anamplification reaction. Determining the amount of fetal DNA also can bedone, for example, by RT-PCR, primer extension, sequencing and/orcounting. In certain instances, the amount of nucleic acid can bedetermined using BEAMing technology as described in U.S. PatentApplication Publication No. 2007/0065823. In certain embodiments, therestriction efficiency can be determined and the efficiency rate is usedto further determine the amount of fetal DNA.

In certain embodiments, a fetal quantifier assay (FQA) can be used todetermine the concentration of fetal DNA in a maternal sample, forexample, by the following method: a) determine the total amount of DNApresent in a maternal sample; b) selectively digest the maternal DNA ina maternal sample using one or more methylation sensitive restrictionenzymes thereby enriching the fetal DNA; c) determine the amount offetal DNA from step b); and d) compare the amount of fetal DNA from stepc) to the total amount of DNA from step a), thereby determining theconcentration of fetal DNA in the maternal sample. In certainembodiments, the absolute copy number of fetal nucleic acid in amaternal sample can be determined, for example, using mass spectrometryand/or a system that uses a competitive PCR approach for absolute copynumber measurements. See for example, Ding and Cantor (2003) Proc. Natl.Acad. Sci. USA 100:3059-3064, and U.S. Patent Application PublicationNo. 2004/0081993, both of which are hereby incorporated by reference.

In certain embodiments, fetal fraction can be determined based onallelic ratios of polymorphic sequences (e.g., single nucleotidepolymorphisms (SNPs)), such as, for example, using a method described inU.S. Patent Application Publication No. 2011/0224087, which is herebyincorporated by reference. In such a method, nucleotide sequence readsare obtained for a maternal sample and fetal fraction is determined bycomparing the total number of nucleotide sequence reads that map to afirst allele and the total number of nucleotide sequence reads that mapto a second allele at an informative polymorphic site (e.g., SNP) in areference genome. In certain embodiments, fetal alleles are identified,for example, by their relative minor contribution to the mixture offetal and maternal nucleic acids in the sample when compared to themajor contribution to the mixture by the maternal nucleic acids.Accordingly, the relative abundance of fetal nucleic acid in a maternalsample can be determined as a parameter of the total number of uniquesequence reads mapped to a target nucleic acid sequence on a referencegenome for each of the two alleles of a polymorphic site.

Fetal fraction can be determined, in some embodiments, using methodsthat incorporate information derived from maternal chromosomalaberrations as described, for example, in International ApplicationPublication No. WO2014/055774, which is incorporated by referenceherein. Fetal fraction can be determined, in some embodiments, usingmethods that incorporate information derived from sex chromosomes.

Fetal fraction can be determined, in some embodiments, using methodsthat incorporate fragment length information (e.g., fragment lengthratio (FLR) analysis, fetal ratio statistic (FRS) analysis as describedin International Application Publication No. WO2013/177086, which isincorporated by reference herein). Cell-free fetal nucleic acidfragments generally are shorter than maternally-derived nucleic acidfragments (see e.g., Chan et al. (2004) Clin. Chem. 50:88-92; Lo et al.(2010) Sci. Transl. Med. 2:61ra91). Thus, fetal fraction can bedetermined, in some embodiments, by counting fragments under aparticular length threshold and comparing the counts, for example, tocounts from fragments over a particular length threshold and/or to theamount of total nucleic acid in the sample. Methods for counting nucleicacid fragments of a particular length are described in further detail inInternational Application Publication No. WO2013/177086.

Fetal fraction can be determined, in some embodiments, according toportion-specific fetal fraction estimates. Portion-specific fetalfraction also may be referred to as bin-wise fetal fraction (BFF), enet,and sequence-based fetal fraction (SeqFF). Without being limited totheory, the amount of reads from fetal CCF fragments (e.g., fragments ofa particular length, or range of lengths) often map with rangingfrequencies to portions (e.g., within the same sample, e.g., within thesame sequencing run). Also, without being limited to theory, certainportions, when compared among multiple samples, tend to have a similarrepresentation of reads from fetal CCF fragments (e.g., fragments of aparticular length, or range of lengths), and that the representationcorrelates with portion-specific fetal fractions (e.g., the relativeamount, percentage or ratio of CCF fragments originating from a fetus).

In some embodiments portion-specific fetal fraction estimates aredetermined based in part on portion-specific parameters and theirrelation to fetal fraction. Portion-specific parameters can be anysuitable parameter that is reflective of (e.g., correlates with) theamount or proportion of reads from CCF fragment lengths of a particularsize (e.g., size range) in a portion. A portion-specific parameter canbe an average, mean or median of portion-specific parameters determinedfor multiple samples. Any suitable portion-specific parameter can beused. Non-limiting examples of portion-specific parameters include FLR(e.g., FRS), an amount of reads having a length less than a selectedfragment length, genomic coverage (i.e., coverage), mappability, counts(e.g., counts of sequence reads mapped to the portion, e.g., normalizedcounts, ChAI normalized counts), DNaseI-sensitivity, methylation state,acetylation, histone distribution, guanine-cytosine (GC) content,chromatin structure, the like or combinations thereof. Aportion-specific parameter can be any suitable parameter that correlateswith FLR and/or FRS in a portion-specific manner. In some embodiments,some or all portion-specific parameters are a direct or indirectrepresentation of an FLR for a portion. In some embodiments aportion-specific parameter is not guanine-cytosine (GC) content.

In some embodiments a portion-specific parameter is any suitable valuerepresenting, correlated with or proportional to an amount of reads fromCCF fragments where the reads mapped to a portion have a length lessthan a selected fragment length. In certain embodiments, aportion-specific parameter is a representation of the amount of readsderived from relatively short CCF fragments (e.g., about 200 base pairsor less) that map to a portion. CCF fragments having a length less thana selected fragment length often are relatively short CCF fragments, andsometimes a selected fragment length is about 200 base pairs or less(e.g., CCF fragments that are about 190, 180, 170, 160, 150, 140, 130,120, 110, 100, 90, or 80 bases in length). The length of a CCF fragmentor a read derived from a CCF fragment can be determined (e.g., deducedor inferred) by any suitable method (e.g., a sequencing method, ahybridization approach). In some embodiments the length of a CCFfragment is determined (e.g., deduced or inferred) by a read obtainedfrom a paired-end sequencing method. In certain embodiments the lengthof a CCF fragment template is determined directly from the length of aread derived from the CCF fragment (e.g., single-end read).

Portion-specific parameters can be weighted or adjusted by one or moreweighting factors. In some embodiments weighted or adjustedportion-specific parameters can provide portion-specific fetal fractionestimates for a sample (e.g., a test sample). In some embodimentsweighting or adjusting generally converts the counts of a portion (e.g.,reads mapped to a portion) or another portion-specific parameter into aportion-specific fetal fraction estimate, and such a conversionsometimes is considered a transformation.

In some embodiments a weighting factor is a coefficient or constantthat, in part, describes and/or defines a relation between a fetalfraction (e.g., a fetal fraction determined from multiple samples) and aportion-specific parameter for multiple samples (e.g., a training set).In some embodiments a weighting factor is determined according to arelation for multiple fetal fraction determinations and multipleportion-specific parameters. A relation may be defined by one or moreweighting factors and one or more weighting factors may be determinedfrom a relation. In some embodiments a weighting factor (e.g., one ormore weighting factors) is determined from a fitted relation for aportion according to (i) a fraction of fetal nucleic acid determined foreach of multiple samples, and (ii) a portion-specific parameter formultiple samples.

A weighting factor can be any suitable coefficient, estimatedcoefficient or constant derived from a suitable relation (e.g., asuitable mathematical relation, an algebraic relation, a fittedrelation, a regression, a regression analysis, a regression model). Aweighting factor can be determined according to, derived from, orestimated from a suitable relation. In some embodiments weightingfactors are estimated coefficients from a fitted relation. Fitting arelation for multiple samples is sometimes referred to as training amodel. Any suitable model and/or method of fitting a relationship (e.g.,training a model to a training set) can be used. Non-limiting examplesof a suitable model that can be used include a regression model, linearregression model, simple regression model, ordinary least squaresregression model, multiple regression model, general multiple regressionmodel, polynomial regression model, general linear model, generalizedlinear model, discrete choice regression model, logistic regressionmodel, multinomial logit model, mixed logit model, probit model,multinomial probit model, ordered logit model, ordered probit model,Poisson model, multivariate response regression model, multilevel model,fixed effects model, random effects model, mixed model, nonlinearregression model, nonparametric model, semiparametric model, robustmodel, quantile model, isotonic model, principal components model, leastangle model, local model, segmented model, and errors-in-variablesmodel. In some embodiments a fitted relation is not a regression model.In some embodiments a fitted relations is chosen from a decision treemodel, support-vector machine model and neural network model. The resultof training a model (e.g., a regression model, a relation) is often arelation that can be described mathematically where the relationcomprises one or more coefficients (e.g., weighting factors). Morecomplex multivariate models may determine one, two, three or moreweighting factors. In some embodiments a model is trained according tofetal fraction and two or more portion-specific parameters (e.g.,coefficients) obtained from multiple samples (e.g., fitted relationshipsfitted to multiple samples, e.g., by a matrix).

A weighting factor can be derived from a suitable relation (e.g., asuitable mathematical relation, an algebraic relation, a fittedrelation, a regression, a regression analysis, a regression model) by asuitable method. In some embodiments fitted relations are fitted by anestimation, non-limiting examples of which include least squares,ordinary least squares, linear, partial, total, generalized, weighted,non-linear, iteratively reweighted, ridge regression, least absolutedeviations, Bayesian, Bayesian multivariate, reduced-rank, LASSO,Weighted Rank Selection Criteria (WRSC), Rank Selection Criteria (RSC),an elastic net estimator (e.g., an elastic net regression) andcombinations thereof.

A weighting factor can be determined for or associated with any suitableportion of a genome. A weighting factor can be determined for orassociated with any suitable portion of any suitable chromosome. In someembodiments a weighting factor is determined for or associated with someor all portions in a genome. In some embodiments a weighting factor isdetermined for or associated with portions of some or all chromosomes ina genome. A weighting factor is sometimes determined for or associatedwith portions of selected chromosomes. A weighting factor can bedetermined for or associated with portions of one or more autosomes. Aweighting factor can be determined for or associated with portions in aplurality of portions that include portions in autosomes or a subsetthereof. In some embodiments a weighting factor is determined for orassociated with portions of a sex chromosome (e.g. ChrX and/or ChrY). Aweighting factor can be determined for or associated with portions ofone or more autosomes and one or more sex chromosomes. In certainembodiments a weighting factor is determined for or associated withportions in a plurality of portions in all autosomes and chromosomes Xand Y. A weighting factor can be determined for or associated withportions in a plurality of portions that does not include portions in anX and/or Y chromosome. In certain embodiments a weighting factor isdetermined for or associated with portions of a chromosome where thechromosome comprises an aneuploidy (e.g., a whole chromosomeaneuploidy). In certain embodiments a weighting factor is determined foror associated only with portions of a chromosome where the chromosome isnot aneuploid (e.g., a euploid chromosome). A weighting factor can bedetermined for or associated with portions in a plurality of portionsthat does not include portions in chromosomes 13, 18 and/or 21.

In some embodiments a weighting factor is determined for a portionaccording to one or more samples (e.g., a training set of samples).Weighting factors are often specific to a portion. In some embodimentsone or more weighting factors are independently assigned to a portion.In some embodiments a weighting factor is determined according to arelation for a fetal fraction determination (e.g., a sample specificfetal fraction determination) for multiple samples and aportion-specific parameter determined according to multiple samples.Weighting factors are often determined from multiple samples, forexample, from about 20 to about 100,000 or more, from about 100 to about100,000 or more, from about 500 to about 100,000 or more, from about1000 to about 100,000 or more, or from about 10,000 to about 100,000 ormore samples. Weighting factors can be determined from samples that areeuploid (e.g., samples from subjects comprising a euploid fetus, e.g.,samples where no aneuploid chromosome is present). In some embodimentsweighting factors are obtained from samples comprising an aneuploidchromosome (e.g., samples from subjects comprising a euploid fetus). Insome embodiments weighting factors are determined from multiple samplesfrom subjects having a euploid fetus and from subjects having a trisomyfetus. Weighting factors can be derived from multiple samples where thesamples are from subjects having a male fetus and/or a female fetus.

A fetal fraction is often determined for one or more samples of atraining set from which a weighting factor is derived. A fetal fractionfrom which a weighting factor is determined is sometimes a samplespecific fetal fraction determination. A fetal fraction from which aweighting factor is determined can be determined by any suitable methoddescribed herein or known in the art. In some embodiments adetermination of fetal nucleic acid content (e.g., fetal fraction) isperformed using a suitable fetal quantifier assay (FQA) described hereinor known in the art, non-limiting examples of which include fetalfraction determinations according to markers specific to a male fetus,based on allelic ratios of polymorphic sequences, according to one ormore markers specific to fetal nucleic acid and not maternal nucleicacid, by use of methylation-based DNA discrimination (e.g., A. Nygren,et al., (2010) Clinical Chemistry 56(10):1627-1635), by a massspectrometry method and/or a system that uses a competitive PCRapproach, by a method described in U.S. Patent Application PublicationNo. 2010/0105049, which is hereby incorporated by reference, the like orcombinations thereof. Often a fetal fraction is determined, in part,according to a level (e.g., one or more genomic section levels, a levelof a profile) of a Y chromosome. In some embodiments a fetal fraction isdetermined according to a suitable assay of a Y chromosome (e.g., bycomparing the amount of fetal-specific locus (such as the SRY locus onchromosome Y in male pregnancies) to that of a locus on any autosomethat is common to both the mother and the fetus by using quantitativereal-time PCR (e.g., Lo Y M, et al. (1998) Am J Hum Genet 62:768-775.)).

Portion-specific parameters (e.g., for a test sample) can be weighted oradjusted by one or more weighting factors (e.g., weighting factorsderived from a training set). For example, a weighting factor can bederived for a portion according to a relation of a portion-specificparameter and a fetal fraction determination for a training set ofmultiple samples. A portion-specific parameter of a test sample can thenbe adjusted and/or weighted according to the weighting factor derivedfrom the training set. In some embodiments a portion-specific parameterfrom which a weighting factor is derived, is the same as theportion-specific parameter (e.g., of a test sample) that is adjusted orweighted (e.g., both parameters are an FLR). In certain embodiment, aportion-specific parameter, from which a weighting factor is derived, isdifferent than the portion-specific parameter (e.g., of a test sample)that is adjusted or weighted. For example, a weighting factor may bedetermined from a relation between coverage (i.e., a portion-specificparameter) and fetal fraction for a training set of samples, and an FLR(i.e., another portion-specific parameter) for a portion of a testsample can be adjusted according to the weighting factor derived fromcoverage. Without being limited by theory, a portion-specific parameter(e.g., for a test sample) can sometimes be adjusted and/or weighted by aweighting factor derived from a different portion-specific parameter(e.g., of a training set) due to a relation and/or correlation betweeneach portion-specific parameter and a common portion-specific FLR.

A portion-specific fetal fraction estimate can be determined for asample (e.g., a test sample) by weighting a portion-specific parameterby a weighting factor determined for that portion. Weighting cancomprise adjusting, converting and/or transforming a portion-specificparameter according to a weighting factor by applying any suitablemathematical manipulation, non-limiting examples of which includemultiplication, division, addition, subtraction, integration, symboliccomputation, algebraic computation, algorithm, trigonometric orgeometric function, transformation (e.g., a Fourier transform), the likeor combinations thereof. Weighting can comprise adjusting, convertingand/or transforming a portion-specific parameter according to aweighting factor a suitable mathematical model.

In some embodiments a fetal fraction is determined for a sampleaccording to one or more portion-specific fetal fraction estimates. Insome embodiments a fetal fraction is determined (e.g., estimated) for asample (e.g., a test sample) according to weighting or adjusting aportion-specific parameter for one or more portions. In certainembodiments a fraction of fetal nucleic acid for a test sample isestimated based on adjusted counts or an adjusted subset of counts. Incertain embodiments a fraction of fetal nucleic acid for a test sampleis estimated based on an adjusted FLR, an adjusted FRS, adjustedcoverage, and/or adjusted mappability for a portion. In some embodimentsabout 1 to about 500,000, about 100 to about 300,000, about 500 to about200,000, about 1000 to about 200,000, about 1500 to about 200,000, orabout 1500 to about 50,000 portion-specific parameters are weighted oradjusted.

A fetal fraction (e.g., for a test sample) can be determined accordingto multiple portion-specific fetal fraction estimates (e.g., for thesame test sample) by any suitable method. In some embodiments a methodfor increasing the accuracy of the estimation of a fraction of fetalnucleic acid in a test sample from a pregnant female comprisesdetermining one or more portion-specific fetal fraction estimates wherethe estimate of fetal fraction for the sample is determined according tothe one or more portion-specific fetal fraction estimates. In someembodiments estimating or determining a fraction of fetal nucleic acidfor a sample (e.g., a test sample) comprises summing one or moreportion-specific fetal fraction estimates. Summing can comprisedetermining an average, mean, median, AUC, or integral value accordingto multiple portion-specific fetal fraction estimates.

In some embodiments a method for increasing the accuracy of theestimation of a fraction of fetal nucleic acid in a test sample from apregnant female, comprises obtaining counts of sequence reads mapped toportions of a reference genome, which sequence reads are reads ofcirculating cell-free nucleic acid from a test sample from a pregnantfemale, where at least a subset of the counts obtained are derived froma region of the genome that contributes a greater number of countsderived from fetal nucleic acid relative to total counts from the regionthan counts of fetal nucleic acid relative to total counts of anotherregion of the genome. In some embodiments an estimate of the fraction offetal nucleic acid is determined according to a subset of the portions,where the subset of the portions is selected according to portions towhich are mapped a greater number of counts derived from fetal nucleicacid than counts of fetal nucleic acid of another portion. In someembodiments the subset of the portions is selected according to portionsto which are mapped a greater number of counts derived from fetalnucleic acid, relative to non-fetal nucleic acid, than counts of fetalnucleic acid, relative to non-fetal nucleic acid, of another portion.The counts mapped to all or a subset of the portions can be weighted,thereby providing weighted counts. The weighted counts can be utilizedfor estimating the fraction of fetal nucleic acid, and the counts can beweighted according to portions to which are mapped a greater number ofcounts derived from fetal nucleic acid than counts of fetal nucleic acidof another portion. In some embodiments the counts are weightedaccording to portions to which are mapped a greater number of countsderived from fetal nucleic acid, relative to non-fetal nucleic acid,than counts of fetal nucleic acid, relative to non-fetal nucleic acid,of another portion.

A fetal fraction can be determined for a sample (e.g., a test sample)according to multiple portion-specific fetal fraction estimates for thesample where the portions-specific estimates are from portions of anysuitable region or segment of a genome. Portion-specific fetal fractionestimates can be determined for one or more portions of a suitablechromosome (e.g., one or more selected chromosomes, one or moreautosomes, a sex chromosome (e.g. ChrX and/or ChrY), an aneuploidchromosome, a euploid chromosome, the like or combinations thereof).

In some embodiments, determining fetal fraction comprises (a) obtainingcounts of sequence reads mapped to portions of a reference genome, whichsequence reads are reads of circulating cell-free nucleic acid from atest sample from a pregnant female; (b) weighting, e.g., using amicroprocessor, (i) the counts of the sequence reads mapped to eachportion, or (ii) other portion-specific parameter, to a portion-specificfraction of fetal nucleic acid according to a weighting factorindependently associated with each portion, thereby providingportion-specific fetal fraction estimates according to the weightingfactors, where each of the weighting factors have been determined from afitted relation for each portion between (i) a fraction of fetal nucleicacid for each of multiple samples, and (ii) counts of sequence readsmapped to each portion, or other portion-specific parameter, for themultiple samples; and (c) estimating a fraction of fetal nucleic acidfor the test sample based on the portion-specific fetal fractionestimates.

The amount of fetal nucleic acid in extracellular nucleic acid can bequantified and used in conjunction with a method provided herein. Thus,in certain embodiments, methods of the technology described hereincomprise an additional step of determining the amount of fetal nucleicacid. The amount of fetal nucleic acid can be determined in a nucleicacid sample from a subject before or after processing to prepare samplenucleic acid. In certain embodiments, the amount of fetal nucleic acidis determined in a sample after sample nucleic acid is processed andprepared, which amount is utilized for further assessment. In someembodiments, an outcome comprises factoring the fraction of fetalnucleic acid in the sample nucleic acid (e.g., adjusting counts,removing samples, making a call or not making a call).

The determination step can be performed before, during, at any one pointin a method described herein, or after certain (e.g., aneuploidydetection, microduplication or microdeletion detection, fetal genderdetermination) methods described herein. For example, to achieve a fetalgender or aneuploidy, microduplication or microdeletion determinationmethod with a given sensitivity or specificity, a fetal nucleic acidquantification method may be implemented prior to, during or after fetalgender or aneuploidy, microduplication or microdeletion determination toidentify those samples with greater than about 2%, 3%, 4%, 5%, 6%, 7%,8%, 9%, 10%, 11%, 12%, 13%, 14%, 15%, 16%, 17%, 18%, 19%, 20%, 21%, 22%,23%, 24%, 25% or more fetal nucleic acid. In some embodiments, samplesdetermined as having a certain threshold amount of fetal nucleic acid(e.g., about 15% or more fetal nucleic acid; about 4% or more fetalnucleic acid) are further analyzed for fetal gender or aneuploidy,microduplication or microdeletion determination, or the presence orabsence of aneuploidy or genetic variation, for example. In certainembodiments, determinations of, for example, fetal gender or thepresence or absence of aneuploidy, microduplication or microdeletion areselected (e.g., selected and communicated to a patient) only for sampleshaving a certain threshold amount of fetal nucleic acid (e.g., about 15%or more fetal nucleic acid; about 4% or more fetal nucleic acid).

In some embodiments, the determination of fetal fraction or determiningthe amount of fetal nucleic acid is not required or necessary foridentifying the presence or absence of a chromosome aneuploidy,microduplication or microdeletion. In some embodiments, identifying thepresence or absence of a chromosome aneuploidy, microduplication ormicrodeletion does not require the sequence differentiation of fetalversus maternal DNA. In certain embodiments this is because the summedcontribution of both maternal and fetal sequences in a particularchromosome, chromosome portion or segment thereof is analyzed. In someembodiments, identifying the presence or absence of a chromosomeaneuploidy, microduplication or microdeletion does not rely on a priorisequence information that would distinguish fetal DNA from maternal DNA.

Enriching Nucleic Acids

In some embodiments, nucleic acid (e.g., extracellular nucleic acid) isenriched or relatively enriched for a subpopulation or species ofnucleic acid. Nucleic acid subpopulations can include, for example,fetal nucleic acid, maternal nucleic acid, nucleic acid comprisingfragments of a particular length or range of lengths, or nucleic acidfrom a particular genome region (e.g., single chromosome, set ofchromosomes, and/or certain chromosome regions). Such enriched samplescan be used in conjunction with a method provided herein. Thus, incertain embodiments, methods of the technology comprise an additionalstep of enriching for a subpopulation of nucleic acid in a sample, suchas, for example, fetal nucleic acid. In certain embodiments, a methodfor determining fetal fraction described above also can be used toenrich for fetal nucleic acid. In certain embodiments, maternal nucleicacid is selectively removed (partially, substantially, almost completelyor completely) from the sample. In certain embodiments, enriching for aparticular low copy number species nucleic acid (e.g., fetal nucleicacid) may improve quantitative sensitivity. Methods for enriching asample for a particular species of nucleic acid are described, forexample, in U.S. Pat. No. 6,927,028, International Patent ApplicationPublication No. WO2007/140417, International Patent ApplicationPublication No. WO2007/147063, International Patent ApplicationPublication No. WO2009/032779, International Patent ApplicationPublication No. WO2009/032781, International Patent ApplicationPublication No. WO2010/033639, International Patent ApplicationPublication No. WO2011/034631, International Patent ApplicationPublication No. WO2006/056480, and International Patent ApplicationPublication No. WO2011/143659, all of which are incorporated byreference herein.

In some embodiments, nucleic acid is enriched for certain targetfragment species and/or reference fragment species. In certainembodiments, nucleic acid is enriched for a specific nucleic acidfragment length or range of fragment lengths using one or morelength-based separation methods described below. In certain embodiments,nucleic acid is enriched for fragments from a select genomic region(e.g., chromosome) using one or more sequence-based separation methodsdescribed herein and/or known in the art. Certain methods for enrichingfor a nucleic acid subpopulation (e.g., fetal nucleic acid) in a sampleare described in detail below.

Some methods for enriching for a nucleic acid subpopulation (e.g., fetalnucleic acid) that can be used with a method described herein includemethods that exploit epigenetic differences between maternal and fetalnucleic acid. For example, fetal nucleic acid can be differentiated andseparated from maternal nucleic acid based on methylation differences.Methylation-based fetal nucleic acid enrichment methods are described inU.S. Patent Application Publication No. 2010/0105049, which isincorporated by reference herein. Such methods sometimes involve bindinga sample nucleic acid to a methylation-specific binding agent(methyl-CpG binding protein (MBD), methylation specific antibodies, andthe like) and separating bound nucleic acid from unbound nucleic acidbased on differential methylation status. Such methods also can includethe use of methylation-sensitive restriction enzymes (as describedabove; e.g., HhaI and HpaII), which allow for the enrichment of fetalnucleic acid regions in a maternal sample by selectively digestingnucleic acid from the maternal sample with an enzyme that selectivelyand completely or substantially digests the maternal nucleic acid toenrich the sample for at least one fetal nucleic acid region.

Another method for enriching for a nucleic acid subpopulation (e.g.,fetal nucleic acid) that can be used with a method described herein is arestriction endonuclease enhanced polymorphic sequence approach, such asa method described in U.S. Patent Application Publication No.2009/0317818, which is incorporated by reference herein. Such methodsinclude cleavage of nucleic acid comprising a non-target allele with arestriction endonuclease that recognizes the nucleic acid comprising thenon-target allele but not the target allele; and amplification ofuncleaved nucleic acid but not cleaved nucleic acid, where theuncleaved, amplified nucleic acid represents enriched target nucleicacid (e.g., fetal nucleic acid) relative to non-target nucleic acid(e.g., maternal nucleic acid). In certain embodiments, nucleic acid maybe selected such that it comprises an allele having a polymorphic sitethat is susceptible to selective digestion by a cleavage agent, forexample.

Some methods for enriching for a nucleic acid subpopulation (e.g., fetalnucleic acid) that can be used with a method described herein includeselective enzymatic degradation approaches. Such methods involveprotecting target sequences from exonuclease digestion therebyfacilitating the elimination in a sample of undesired sequences (e.g.,maternal DNA). For example, in one approach, sample nucleic acid isdenatured to generate single stranded nucleic acid, single strandednucleic acid is contacted with at least one target-specific primer pairunder suitable annealing conditions, annealed primers are extended bynucleotide polymerization generating double stranded target sequences,and digesting single stranded nucleic acid using a nuclease that digestssingle stranded (i.e. non-target) nucleic acid. In certain embodiments,the method can be repeated for at least one additional cycle. In certainembodiments, the same target-specific primer pair is used to prime eachof the first and second cycles of extension, and In certain embodiments,different target-specific primer pairs are used for the first and secondcycles.

Some methods for enriching for a nucleic acid subpopulation (e.g., fetalnucleic acid) that can be used with a method described herein includemassively parallel signature sequencing (MPSS) approaches. MPSStypically is a solid phase method that uses adapter (i.e. tag) ligation,followed by adapter decoding, and reading of the nucleic acid sequencein small increments. Tagged PCR products are typically amplified suchthat each nucleic acid generates a PCR product with a unique tag. Tagsare often used to attach the PCR products to microbeads. After severalrounds of ligation-based sequence determination, for example, a sequencesignature can be identified from each bead. Each signature sequence(MPSS tag) in a MPSS dataset is analyzed, compared with all othersignatures, and all identical signatures are counted.

In certain embodiments, certain enrichment methods (e.g., certain MPSand/or MPSS-based enrichment methods) can include amplification (e.g.,PCR)-based approaches. In certain embodiments, loci-specificamplification methods can be used (e.g., using loci-specificamplification primers). In certain embodiments, a multiplex SNP allelePCR approach can be used. In certain embodiments, a multiplex SNP allelePCR approach can be used in combination with uniplex sequencing. Forexample, such an approach can involve the use of multiplex PCR (e.g.,MASSARRAY system) and incorporation of capture probe sequences into theamplicons followed by sequencing using, for example, the Illumina MPSSsystem. In certain embodiments, a multiplex SNP allele PCR approach canbe used in combination with a three-primer system and indexedsequencing. For example, such an approach can involve the use ofmultiplex PCR (e.g., MASSARRAY system) with primers having a firstcapture probe incorporated into certain loci-specific forward PCRprimers and adapter sequences incorporated into loci-specific reversePCR primers, to thereby generate amplicons, followed by a secondary PCRto incorporate reverse capture sequences and molecular index barcodesfor sequencing using, for example, the Illumina MPSS system. In certainembodiments, a multiplex SNP allele PCR approach can be used incombination with a four-primer system and indexed sequencing. Forexample, such an approach can involve the use of multiplex PCR (e.g.,MASSARRAY system) with primers having adaptor sequences incorporatedinto both loci-specific forward and loci-specific reverse PCR primers,followed by a secondary PCR to incorporate both forward and reversecapture sequences and molecular index barcodes for sequencing using, forexample, the Illumina MPSS system. In certain embodiments, amicrofluidics approach can be used. In certain embodiments, anarray-based microfluidics approach can be used. For example, such anapproach can involve the use of a microfluidics array (e.g., Fluidigm)for amplification at low plex and incorporation of index and captureprobes, followed by sequencing. In certain embodiments, an emulsionmicrofluidics approach can be used, such as, for example, digitaldroplet PCR.

In certain embodiments, universal amplification methods can be used(e.g., using universal or non-loci-specific amplification primers). Incertain embodiments, universal amplification methods can be used incombination with pull-down approaches. In certain embodiments, a methodcan include biotinylated ultramer pull-down (e.g., biotinylatedpull-down assays from Agilent or IDT) from a universally amplifiedsequencing library. For example, such an approach can involvepreparation of a standard library, enrichment for selected regions by apull-down assay, and a secondary universal amplification step. Incertain embodiments, pull-down approaches can be used in combinationwith ligation-based methods. In certain embodiments, a method caninclude biotinylated ultramer pull down with sequence specific adapterligation (e.g., HALOPLEX PCR, Halo Genomics). For example, such anapproach can involve the use of selector probes to capture restrictionenzyme-digested fragments, followed by ligation of captured products toan adaptor, and universal amplification followed by sequencing. Incertain embodiments, pull-down approaches can be used in combinationwith extension and ligation-based methods. In certain embodiments, amethod can include molecular inversion probe (MIP) extension andligation. For example, such an approach can involve the use of molecularinversion probes in combination with sequence adapters followed byuniversal amplification and sequencing. In certain embodiments,complementary DNA can be synthesized and sequenced withoutamplification.

In certain embodiments, extension and ligation approaches can beperformed without a pull-down component. In certain embodiments, amethod can include loci-specific forward and reverse primerhybridization, extension and ligation. Such methods can further includeuniversal amplification or complementary DNA synthesis withoutamplification, followed by sequencing. Such methods can reduce orexclude background sequences during analysis, in certain embodiments.

In certain embodiments, pull-down approaches can be used with anoptional amplification component or with no amplification component. Incertain embodiments, a method can include a modified pull-down assay andligation with full incorporation of capture probes without universalamplification. For example, such an approach can involve the use ofmodified selector probes to capture restriction enzyme-digestedfragments, followed by ligation of captured products to an adaptor,optional amplification, and sequencing. In certain embodiments, a methodcan include a biotinylated pull-down assay with extension and ligationof adaptor sequence in combination with circular single strandedligation. For example, such an approach can involve the use of selectorprobes to capture regions of interest (i.e. target sequences), extensionof the probes, adaptor ligation, single stranded circular ligation,optional amplification, and sequencing. In certain embodiments, theanalysis of the sequencing result can separate target sequences formbackground.

In some embodiments, nucleic acid is enriched for fragments from aselect genomic region (e.g., chromosome) using one or moresequence-based separation methods described herein. Sequence-basedseparation generally is based on nucleotide sequences present in thefragments of interest (e.g., target and/or reference fragments) andsubstantially not present in other fragments of the sample or present inan insubstantial amount of the other fragments (e.g., 5% or less). Insome embodiments, sequence-based separation can generate separatedtarget fragments and/or separated reference fragments. Separated targetfragments and/or separated reference fragments often are isolated awayfrom the remaining fragments in the nucleic acid sample. In certainembodiments, the separated target fragments and the separated referencefragments also are isolated away from each other (e.g., isolated inseparate assay compartments). In certain embodiments, the separatedtarget fragments and the separated reference fragments are isolatedtogether (e.g., isolated in the same assay compartment). In someembodiments, unbound fragments can be differentially removed or degradedor digested.

In some embodiments, a selective nucleic acid capture process is used toseparate target and/or reference fragments away from the nucleic acidsample. Commercially available nucleic acid capture systems include, forexample, Nimblegen sequence capture system (Roche NimbleGen, Madison,Wis.); Illumina BEADARRAY platform (Illumina, San Diego, Calif.);Affymetrix GENECHIP platform (Affymetrix, Santa Clara, Calif.); AgilentSureSelect Target Enrichment System (Agilent Technologies, Santa Clara,Calif.); and related platforms. Such methods typically involvehybridization of a capture oligonucleotide to a segment or all of thenucleotide sequence of a target or reference fragment and can includeuse of a solid phase (e.g., solid phase array) and/or a solution basedplatform. Capture oligonucleotides (sometimes referred to as “bait”) canbe selected or designed such that they preferentially hybridize tonucleic acid fragments from selected genomic regions or loci (e.g., oneof chromosomes 21, 18, 13, X or Y, or a reference chromosome). Incertain embodiments, a hybridization-based method (e.g., usingoligonucleotide arrays) can be used to enrich for nucleic acid sequencesfrom certain chromosomes (e.g., a potentially aneuploid chromosome,reference chromosome or other chromosome of interest) or segments ofinterest thereof.

In some embodiments, nucleic acid is enriched for a particular nucleicacid fragment length, range of lengths, or lengths under or over aparticular threshold or cutoff using one or more length-based separationmethods. Nucleic acid fragment length typically refers to the number ofnucleotides in the fragment. Nucleic acid fragment length also issometimes referred to as nucleic acid fragment size. In someembodiments, a length-based separation method is performed withoutmeasuring lengths of individual fragments. In some embodiments, a lengthbased separation method is performed in conjunction with a method fordetermining length of individual fragments. In some embodiments,length-based separation refers to a size fractionation procedure whereall or part of the fractionated pool can be isolated (e.g., retained)and/or analyzed. Size fractionation procedures are known in the art(e.g., separation on an array, separation by a molecular sieve,separation by gel electrophoresis, separation by column chromatography(e.g., size-exclusion columns), and microfluidics-based approaches). Incertain embodiments, length-based separation approaches can includefragment circularization, chemical treatment (e.g., formaldehyde,polyethylene glycol (PEG)), mass spectrometry and/or size-specificnucleic acid amplification, for example.

In some embodiments, nucleic acid fragments of a certain length, rangeof lengths, or lengths under or over a particular threshold or cutoffare separated from the sample. In some embodiments, fragments having alength under a particular threshold or cutoff (e.g., 500 bp, 400 bp, 300bp, 200 bp, 150 bp, 100 bp) are referred to as “short” fragments andfragments having a length over a particular threshold or cutoff (e.g.,500 bp, 400 bp, 300 bp, 200 bp, 150 bp, 100 bp) are referred to as“long” fragments. In some embodiments, fragments of a certain length,range of lengths, or lengths under or over a particular threshold orcutoff are retained for analysis while fragments of a different lengthor range of lengths, or lengths over or under the threshold or cutoffare not retained for analysis. In some embodiments, fragments that areless than about 500 bp are retained. In some embodiments, fragments thatare less than about 400 bp are retained. In some embodiments, fragmentsthat are less than about 300 bp are retained. In some embodiments,fragments that are less than about 200 bp are retained. In someembodiments, fragments that are less than about 150 bp are retained. Forexample, fragments that are less than about 190 bp, 180 bp, 170 bp, 160bp, 150 bp, 140 bp, 130 bp, 120 bp, 110 bp or 100 bp are retained. Insome embodiments, fragments that are about 100 bp to about 200 bp areretained. For example, fragments that are about 190 bp, 180 bp, 170 bp,160 bp, 150 bp, 140 bp, 130 bp, 120 bp or 110 bp are retained. In someembodiments, fragments that are in the range of about 100 bp to about200 bp are retained. For example, fragments that are in the range ofabout 110 bp to about 190 bp, 130 bp to about 180 bp, 140 bp to about170 bp, 140 bp to about 150 bp, 150 bp to about 160 bp, or 145 bp toabout 155 bp are retained. In some embodiments, fragments that are about10 bp to about 30 bp shorter than other fragments of a certain length orrange of lengths are retained. In some embodiments, fragments that areabout 10 bp to about 20 bp shorter than other fragments of a certainlength or range of lengths are retained. In some embodiments, fragmentsthat are about 10 bp to about 15 bp shorter than other fragments of acertain length or range of lengths are retained.

In some embodiments, nucleic acid is enriched for a particular nucleicacid fragment length, range of lengths, or lengths under or over aparticular threshold or cutoff using one or more bioinformatics-based(e.g., in silico) methods. For example, nucleotide sequence reads can beobtained for nucleic acid fragments using a suitable nucleotidesequencing process. In some instances, such as when a paired-endsequencing method is used, the length of a particular fragment can bedetermined based on the positions of mapped sequence reads obtained fromeach terminus of the fragment. Sequence reads used for a particularanalysis (e.g., determining the presence or absence of a geneticvariation) can be enriched or filtered according to one or more selectedfragment lengths or fragment length threshold values of correspondingfragments, as described herein.

Certain length-based separation methods that can be used with methodsdescribed herein employ a selective sequence tagging approach, forexample. The term “sequence tagging” refers to incorporating arecognizable and distinct sequence into a nucleic acid or population ofnucleic acids. The term “sequence tagging” as used herein has adifferent meaning than the term “sequence tag” described later herein.In such sequence tagging methods, a fragment size species (e.g., shortfragments) nucleic acids are subjected to selective sequence tagging ina sample that includes long and short nucleic acids. Such methodstypically involve performing a nucleic acid amplification reaction usinga set of nested primers which include inner primers and outer primers.In certain embodiments, one or both of the inner can be tagged tothereby introduce a tag onto the target amplification product. The outerprimers generally do not anneal to the short fragments that carry the(inner) target sequence. The inner primers can anneal to the shortfragments and generate an amplification product that carries a tag andthe target sequence. Typically, tagging of the long fragments isinhibited through a combination of mechanisms which include, forexample, blocked extension of the inner primers by the prior annealingand extension of the outer primers. Enrichment for tagged fragments canbe accomplished by any of a variety of methods, including for example,exonuclease digestion of single stranded nucleic acid and amplificationof the tagged fragments using amplification primers specific for atleast one tag.

Another length-based separation method that can be used with methodsdescribed herein involves subjecting a nucleic acid sample topolyethylene glycol (PEG) precipitation. Examples of methods includethose described in International Patent Application Publication Nos.WO2007/140417 and WO2010/115016. This method in general entailscontacting a nucleic acid sample with PEG in the presence of one or moremonovalent salts under conditions sufficient to substantiallyprecipitate large nucleic acids without substantially precipitatingsmall (e.g., less than 300 nucleotides) nucleic acids.

Another size-based enrichment method that can be used with methodsdescribed herein involves circularization by ligation, for example,using circligase. Short nucleic acid fragments typically can becircularized with higher efficiency than long fragments.Non-circularized sequences can be separated from circularized sequences,and the enriched short fragments can be used for further analysis.

Nucleic Acid Library

In some embodiments a nucleic acid library is a plurality ofpolynucleotide molecules (e.g., a sample of nucleic acids) that areprepared, assemble and/or modified for a specific process, non-limitingexamples of which include immobilization on a solid phase (e.g., a solidsupport, e.g., a flow cell, a bead), enrichment, amplification, cloning,detection and/or for nucleic acid sequencing. In certain embodiments, anucleic acid library is prepared prior to or during a sequencingprocess. A nucleic acid library (e.g., sequencing library) can beprepared by a suitable method as known in the art. A nucleic acidlibrary can be prepared by a targeted or a non-targeted preparationprocess.

In some embodiments a library of nucleic acids is modified to comprise achemical moiety (e.g., a functional group) configured for immobilizationof nucleic acids to a solid support. In some embodiments a library ofnucleic acids is modified to comprise a biomolecule (e.g., a functionalgroup) and/or member of a binding pair configured for immobilization ofthe library to a solid support, non-limiting examples of which includethyroxin-binding globulin, steroid-binding proteins, antibodies,antigens, haptens, enzymes, lectins, nucleic acids, repressors, proteinA, protein G, avidin, streptavidin, biotin, complement component C1q,nucleic acid-binding proteins, receptors, carbohydrates,oligonucleotides, polynucleotides, complementary nucleic acid sequences,the like and combinations thereof. Some examples of specific bindingpairs include, without limitation: an avidin moiety and a biotin moiety;an antigenic epitope and an antibody or immunologically reactivefragment thereof; an antibody and a hapten; a digoxigen moiety and ananti-digoxigen antibody; a fluorescein moiety and an anti-fluoresceinantibody; an operator and a repressor; a nuclease and a nucleotide; alectin and a polysaccharide; a steroid and a steroid-binding protein; anactive compound and an active compound receptor; a hormone and a hormonereceptor; an enzyme and a substrate; an immunoglobulin and protein A; anoligonucleotide or polynucleotide and its corresponding complement; thelike or combinations thereof.

In some embodiments a library of nucleic acids is modified to compriseone or more polynucleotides of known composition, non-limiting examplesof which include an identifier (e.g., a tag, an indexing tag), a capturesequence, a label, an adapter, a restriction enzyme site, a promoter, anenhancer, an origin of replication, a stem loop, a complimentarysequence (e.g., a primer binding site, an annealing site), a suitableintegration site (e.g., a transposon, a viral integration site), amodified nucleotide, the like or combinations thereof. Polynucleotidesof known sequence can be added at a suitable position, for example onthe 5′ end, 3′ end or within a nucleic acid sequence. Polynucleotides ofknown sequence can be the same or different sequences. In someembodiments a polynucleotide of known sequence is configured tohybridize to one or more oligonucleotides immobilized on a surface(e.g., a surface in flow cell). For example, a nucleic acid moleculecomprising a 5′ known sequence may hybridize to a first plurality ofoligonucleotides while the 3′ known sequence may hybridize to a secondplurality of oligonucleotides. In some embodiments a library of nucleicacid can comprise chromosome-specific tags, capture sequences, labelsand/or adaptors. In some embodiments, a library of nucleic acidscomprises one or more detectable labels. In some embodiments one or moredetectable labels may be incorporated into a nucleic acid library at a5′ end, at a 3′ end, and/or at any nucleotide position within a nucleicacid in the library. In some embodiments a library of nucleic acidscomprises hybridized oligonucleotides. In certain embodiments hybridizedoligonucleotides are labeled probes. In some embodiments a library ofnucleic acids comprises hybridized oligonucleotide probes prior toimmobilization on a solid phase.

In some embodiments a polynucleotide of known sequence comprises auniversal sequence. A universal sequence is a specific nucleotide acidsequence that is integrated into two or more nucleic acid molecules ortwo or more subsets of nucleic acid molecules where the universalsequence is the same for all molecules or subsets of molecules that itis integrated into. A universal sequence is often designed to hybridizeto and/or amplify a plurality of different sequences using a singleuniversal primer that is complementary to a universal sequence. In someembodiments two (e.g., a pair) or more universal sequences and/oruniversal primers are used. A universal primer often comprises auniversal sequence. In some embodiments adapters (e.g., universaladapters) comprise universal sequences. In some embodiments one or moreuniversal sequences are used to capture, identify and/or detect multiplespecies or subsets of nucleic acids.

In certain embodiments of preparing a nucleic acid library, (e.g., incertain sequencing by synthesis procedures), nucleic acids are sizeselected and/or fragmented into lengths of several hundred base pairs,or less (e.g., in preparation for library generation). In someembodiments, library preparation is performed without fragmentation(e.g., when using ccfDNA).

In certain embodiments, a ligation-based library preparation method isused (e.g., ILLUMINA TRUSEQ, Illumina, San Diego Calif.). Ligation-basedlibrary preparation methods often make use of an adaptor (e.g., amethylated adaptor) design which can incorporate an index sequence atthe initial ligation step and often can be used to prepare samples forsingle-read sequencing, paired-end sequencing and multiplexedsequencing. For example, sometimes nucleic acids (e.g., fragmentednucleic acids or ccfDNA) are end repaired by a fill-in reaction, anexonuclease reaction or a combination thereof. In some embodiments theresulting blunt-end repaired nucleic acid can then be extended by asingle nucleotide, which is complementary to a single nucleotideoverhang on the 3′ end of an adapter/primer. Any nucleotide can be usedfor the extension/overhang nucleotides. In some embodiments nucleic acidlibrary preparation comprises ligating an adapter oligonucleotide.Adapter oligonucleotides are often complementary to flow-cell anchors,and sometimes are utilized to immobilize a nucleic acid library to asolid support, such as the inside surface of a flow cell, for example.In some embodiments, an adapter oligonucleotide comprises an identifier,one or more sequencing primer hybridization sites (e.g., sequencescomplementary to universal sequencing primers, single end sequencingprimers, paired end sequencing primers, multiplexed sequencing primers,and the like), or combinations thereof (e.g., adapter/sequencing,adapter/identifier, adapter/identifier/sequencing).

An identifier can be a suitable detectable label incorporated into orattached to a nucleic acid (e.g., a polynucleotide) that allowsdetection and/or identification of nucleic acids that comprise theidentifier. In some embodiments an identifier is incorporated into orattached to a nucleic acid during a sequencing method (e.g., by apolymerase). Non-limiting examples of identifiers include nucleic acidtags, nucleic acid indexes or barcodes, a radiolabel (e.g., an isotope),metallic label, a fluorescent label, a chemiluminescent label, aphosphorescent label, a fluorophore quencher, a dye, a protein (e.g., anenzyme, an antibody or part thereof, a linker, a member of a bindingpair), the like or combinations thereof. In some embodiments anidentifier (e.g., a nucleic acid index or barcode) is a unique, knownand/or identifiable sequence of nucleotides or nucleotide analogues. Insome embodiments identifiers are six or more contiguous nucleotides. Amultitude of fluorophores are available with a variety of differentexcitation and emission spectra. Any suitable type and/or number offluorophores can be used as an identifier. In some embodiments 1 ormore, 2 or more, 3 or more, 4 or more, 5 or more, 6 or more, 7 or more,8 or more, 9 or more, 10 or more, 20 or more, 30 or more or 50 or moredifferent identifiers are utilized in a method described herein (e.g., anucleic acid detection and/or sequencing method). In some embodiments,one or two types of identifiers (e.g., fluorescent labels) are linked toeach nucleic acid in a library. Detection and/or quantification of anidentifier can be performed by a suitable method or apparatus,non-limiting examples of which include flow cytometry, quantitativepolymerase chain reaction (qPCR), gel electrophoresis, a luminometer, afluorometer, a spectrophotometer, a suitable gene-chip or microarrayanalysis, Western blot, mass spectrometry, chromatography,cytofluorimetric analysis, fluorescence microscopy, a suitablefluorescence or digital imaging method, confocal laser scanningmicroscopy, laser scanning cytometry, affinity chromatography, manualbatch mode separation, electric field suspension, a suitable nucleicacid sequencing method and/or nucleic acid sequencing apparatus, thelike and combinations thereof.

In some embodiments, a transposon-based library preparation method isused (e.g., EPICENTRE NEXTERA, Epicentre, Madison Wis.).Transposon-based methods typically use in vitro transposition tosimultaneously fragment and tag DNA in a single-tube reaction (oftenallowing incorporation of platform-specific tags and optional barcodes),and prepare sequencer-ready libraries.

In some embodiments a nucleic acid library or parts thereof areamplified (e.g., amplified by a PCR-based method). In some embodiments asequencing method comprises amplification of a nucleic acid library. Anucleic acid library can be amplified prior to or after immobilizationon a solid support (e.g., a solid support in a flow cell). Nucleic acidamplification includes the process of amplifying or increasing thenumbers of a nucleic acid template and/or of a complement thereof thatare present (e.g., in a nucleic acid library), by producing one or morecopies of the template and/or its complement. Amplification can becarried out by a suitable method. A nucleic acid library can beamplified by a thermocycling method or by an isothermal amplificationmethod. In some embodiments a rolling circle amplification method isused. In some embodiments amplification takes place on a solid support(e.g., within a flow cell) where a nucleic acid library or portionthereof is immobilized. In certain sequencing methods, a nucleic acidlibrary is added to a flow cell and immobilized by hybridization toanchors under suitable conditions. This type of nucleic acidamplification is often referred to as solid phase amplification. In someembodiments of solid phase amplification, all or a portion of theamplified products are synthesized by an extension initiating from animmobilized primer. Solid phase amplification reactions are analogous tostandard solution phase amplifications except that at least one of theamplification oligonucleotides (e.g., primers) is immobilized on a solidsupport.

In some embodiments solid phase amplification comprises a nucleic acidamplification reaction comprising only one species of oligonucleotideprimer immobilized to a surface. In certain embodiments solid phaseamplification comprises a plurality of different immobilizedoligonucleotide primer species. In some embodiments solid phaseamplification may comprise a nucleic acid amplification reactioncomprising one species of oligonucleotide primer immobilized on a solidsurface and a second different oligonucleotide primer species insolution. Multiple different species of immobilized or solution basedprimers can be used. Non-limiting examples of solid phase nucleic acidamplification reactions include interfacial amplification, bridgeamplification, emulsion PCR, WildFire amplification (e.g., US patentpublication US20130012399), the like or combinations thereof.

Sequencing

In some embodiments, nucleic acids (e.g., nucleic acid fragments, samplenucleic acid, cell-free nucleic acid) are sequenced. In certainembodiments, a full or substantially full sequence is obtained andsometimes a partial sequence is obtained.

In some embodiments, fragment length is determined using a sequencingmethod. In some embodiments, fragment length is determined using apaired-end sequencing platform. Such platforms involve sequencing ofboth ends of a nucleic acid fragment. Generally, the sequencescorresponding to both ends of the fragment can be mapped to a referencegenome (e.g., a reference human genome). In certain embodiments, bothends are sequenced at a read length that is sufficient to map,individually for each fragment end, to a reference genome. Examples ofpaired-end sequence read lengths are described below. In certainembodiments, all or a portion of the sequence reads can be mapped to areference genome without mismatch. In some embodiments, each read ismapped independently. In some embodiments, information from bothsequence reads (i.e., from each end) is factored in the mapping process.The length of a fragment can be determined, for example, by calculatingthe difference between genomic coordinates assigned to each mappedpaired-end read.

In some embodiments, fragment length can be determined using asequencing process whereby a complete, or substantially complete,nucleotide sequence is obtained for the fragment. Such sequencingprocesses include platforms that generate relatively long read lengths(e.g., Roche 454, Ion Torrent, single molecule (Pacific Biosciences),real-time SMRT technology, and the like).

In some embodiments some or all nucleic acids in a sample are enrichedand/or amplified (e.g., non-specifically, e.g., by a PCR based method)prior to or during sequencing. In certain embodiments specific nucleicacid portions or subsets in a sample are enriched and/or amplified priorto or during sequencing. In some embodiments, a portion or subset of apre-selected pool of nucleic acids is sequenced randomly. In someembodiments, nucleic acids in a sample are not enriched and/or amplifiedprior to or during sequencing.

As used herein, “reads” (i.e., “a read”, “a sequence read”) are shortnucleotide sequences produced by any sequencing process described hereinor known in the art. Reads can be generated from one end of nucleic acidfragments (“single-end reads”), and sometimes are generated from bothends of nucleic acids (e.g., paired-end reads, double-end reads).

The length of a sequence read is often associated with the particularsequencing technology. High-throughput methods, for example, providesequence reads that can vary in size from tens to hundreds of base pairs(bp). Nanopore sequencing, for example, can provide sequence reads thatcan vary in size from tens to hundreds to thousands of base pairs. Insome embodiments, sequence reads are of a mean, median, average orabsolute length of about 15 bp to about 900 bp long. In certainembodiments sequence reads are of a mean, median, average or absolutelength about 1000 bp or more.

In some embodiments the nominal, average, mean or absolute length ofsingle-end reads sometimes is about 15 contiguous nucleotides to about50 or more contiguous nucleotides, about 15 contiguous nucleotides toabout 40 or more contiguous nucleotides, and sometimes about 15contiguous nucleotides or about 36 or more contiguous nucleotides. Incertain embodiments the nominal, average, mean or absolute length ofsingle-end reads is about 20 to about 30 bases, or about 24 to about 28bases in length. In certain embodiments the nominal, average, mean orabsolute length of single-end reads is about 1, 2, 3, 4, 5, 6, 7, 8, 9,10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 21, 22, 23, 24, 25, 26, 27, 28or about 29 bases or more in length.

In certain embodiments, the nominal, average, mean or absolute length ofpaired-end reads sometimes is about 10 contiguous nucleotides to about25 contiguous nucleotides or more (e.g., about 10, 11, 12, 13, 14, 15,16, 17, 18, 19, 20, 21, 22, 23, 24 or 25 nucleotides in length or more),about 15 contiguous nucleotides to about 20 contiguous nucleotides ormore, and sometimes is about 17 contiguous nucleotides, about 18contiguous nucleotides, about 20 contiguous nucleotides, about 25contiguous nucleotides, about 36 contiguous nucleotides or about 45contiguous nucleotides.

Reads generally are representations of nucleotide sequences in aphysical nucleic acid. For example, in a read containing an ATGCdepiction of a sequence, “A” represents an adenine nucleotide, “T”represents a thymine nucleotide, “G” represents a guanine nucleotide and“C” represents a cytosine nucleotide, in a physical nucleic acid.Sequence reads obtained from the blood of a pregnant female can be readsfrom a mixture of fetal and maternal nucleic acid. A mixture ofrelatively short reads can be transformed by processes described hereininto a representation of a genomic nucleic acid present in the pregnantfemale and/or in the fetus. A mixture of relatively short reads can betransformed into a representation of a copy number variation (e.g., amaternal and/or fetal copy number variation), genetic variation or ananeuploidy, microduplication or microdeletion, for example. Reads of amixture of maternal and fetal nucleic acid can be transformed into arepresentation of a composite chromosome or a segment thereof comprisingfeatures of one or both maternal and fetal chromosomes. In certainembodiments, “obtaining” nucleic acid sequence reads of a sample from asubject and/or “obtaining” nucleic acid sequence reads of a biologicalspecimen from one or more reference persons can involve directlysequencing nucleic acid to obtain the sequence information. In someembodiments, “obtaining” can involve receiving sequence informationobtained directly from a nucleic acid by another.

In some embodiments, a representative fraction of a genome is sequencedand is sometimes referred to as “coverage” or “fold coverage”. Forexample, a 1-fold coverage indicates that roughly 100% of the nucleotidesequences of the genome are represented by reads. In some embodiments“fold coverage” is a relative term referring to a prior sequencing runas a reference. For example, a second sequencing run may have 2-foldless coverage than a first sequencing run. In some embodiments a genomeis sequenced with redundancy, where a given region of the genome can becovered by two or more reads or overlapping reads (e.g., a “foldcoverage” greater than 1, e.g., a 2-fold coverage).

In some embodiments, one nucleic acid sample from one individual issequenced. In certain embodiments, nucleic acids from each of two ormore samples are sequenced, where samples are from one individual orfrom different individuals. In certain embodiments, nucleic acid samplesfrom two or more biological samples are pooled, where each biologicalsample is from one individual or two or more individuals, and the poolis sequenced. In the latter embodiments, a nucleic acid sample from eachbiological sample often is identified by one or more unique identifiers.

In some embodiments a sequencing method utilizes identifiers that allowmultiplexing of sequence reactions in a sequencing process. The greaterthe number of unique identifiers, the greater the number of samplesand/or chromosomes for detection, for example, that can be multiplexedin a sequencing process. A sequencing process can be performed using anysuitable number of unique identifiers (e.g., 4, 8, 12, 24, 48, 96, ormore).

A sequencing process sometimes makes use of a solid phase, and sometimesthe solid phase comprises a flow cell on which nucleic acid from alibrary can be attached and reagents can be flowed and contacted withthe attached nucleic acid. A flow cell sometimes includes flow celllanes, and use of identifiers can facilitate analyzing a number ofsamples in each lane. A flow cell often is a solid support that can beconfigured to retain and/or allow the orderly passage of reagentsolutions over bound analytes. Flow cells frequently are planar inshape, optically transparent, generally in the millimeter orsub-millimeter scale, and often have channels or lanes in which theanalyte/reagent interaction occurs. In some embodiments the number ofsamples analyzed in a given flow cell lane are dependent on the numberof unique identifiers utilized during library preparation and/or probedesign. single flow cell lane. Multiplexing using 12 identifiers, forexample, allows simultaneous analysis of 96 samples (e.g., equal to thenumber of wells in a 96 well microwell plate) in an 8 lane flow cell.Similarly, multiplexing using 48 identifiers, for example, allowssimultaneous analysis of 384 samples (e.g., equal to the number of wellsin a 384 well microwell plate) in an 8 lane flow cell. Non-limitingexamples of commercially available multiplex sequencing kits includeIllumina's multiplexing sample preparation oligonucleotide kit andmultiplexing sequencing primers and PhiX control kit (e.g., Illumina'scatalog numbers PE-400-1001 and PE-400-1002, respectively).

Any suitable method of sequencing nucleic acids can be used,non-limiting examples of which include Maxim & Gilbert,chain-termination methods, sequencing by synthesis, sequencing byligation, sequencing by mass spectrometry, microscopy-based techniques,the like or combinations thereof. In some embodiments, a firstgeneration technology, such as, for example, Sanger sequencing methodsincluding automated Sanger sequencing methods, including microfluidicSanger sequencing, can be used in a method provided herein. In someembodiments sequencing technologies that include the use of nucleic acidimaging technologies (e.g. transmission electron microscopy (TEM) andatomic force microscopy (AFM)), can be used. In some embodiments, ahigh-throughput sequencing method is used. High-throughput sequencingmethods generally involve clonally amplified DNA templates or single DNAmolecules that are sequenced in a massively parallel fashion, sometimeswithin a flow cell. Next generation (e.g., 2nd and 3rd generation)sequencing techniques capable of sequencing DNA in a massively parallelfashion can be used for methods described herein and are collectivelyreferred to herein as “massively parallel sequencing” (MPS). In someembodiments MPS sequencing methods utilize a targeted approach, wherespecific chromosomes, genes or regions of interest are sequences. Incertain embodiments a non-targeted approach is used where most or allnucleic acids in a sample are sequenced, amplified and/or capturedrandomly.

In some embodiments a targeted enrichment, amplification and/orsequencing approach is used. A targeted approach often isolates, selectsand/or enriches a subset of nucleic acids in a sample for furtherprocessing by use of sequence-specific oligonucleotides. In someembodiments a library of sequence-specific oligonucleotides are utilizedto target (e.g., hybridize to) one or more sets of nucleic acids in asample. Sequence-specific oligonucleotides and/or primers are oftenselective for particular sequences (e.g., unique nucleic acid sequences)present in one or more chromosomes, genes, exons, introns, and/orregulatory regions of interest. Any suitable method or combination ofmethods can be used for enrichment, amplification and/or sequencing ofone or more subsets of targeted nucleic acids. In some embodimentstargeted sequences are isolated and/or enriched by capture to a solidphase (e.g., a flow cell, a bead) using one or more sequence-specificanchors. In some embodiments targeted sequences are enriched and/oramplified by a polymerase-based method (e.g., a PCR-based method, by anysuitable polymerase based extension) using sequence-specific primersand/or primer sets. Sequence specific anchors often can be used assequence-specific primers.

MPS sequencing sometimes makes use of sequencing by synthesis andcertain imaging processes. A nucleic acid sequencing technology that maybe used in a method described herein is sequencing-by-synthesis andreversible terminator-based sequencing (e.g. Illumina's Genome Analyzer;Genome Analyzer II; HISEQ 2000; HISEQ 2500 (Illumina, San DiegoCalif.)). With this technology, millions of nucleic acid (e.g. DNA)fragments can be sequenced in parallel. In one example of this type ofsequencing technology, a flow cell is used which contains an opticallytransparent slide with 8 individual lanes on the surfaces of which arebound oligonucleotide anchors (e.g., adaptor primers). A flow cell oftenis a solid support that can be configured to retain and/or allow theorderly passage of reagent solutions over bound analytes. Flow cellsfrequently are planar in shape, optically transparent, generally in themillimeter or sub-millimeter scale, and often have channels or lanes inwhich the analyte/reagent interaction occurs.

Sequencing by synthesis, in some embodiments, comprises iterativelyadding (e.g., by covalent addition) a nucleotide to a primer orpreexisting nucleic acid strand in a template directed manner. Eachiterative addition of a nucleotide is detected and the process isrepeated multiple times until a sequence of a nucleic acid strand isobtained. The length of a sequence obtained depends, in part, on thenumber of addition and detection steps that are performed. In someembodiments of sequencing by synthesis, one, two, three or morenucleotides of the same type (e.g., A, G, C or T) are added and detectedin a round of nucleotide addition. Nucleotides can be added by anysuitable method (e.g., enzymatically or chemically). For example, insome embodiments a polymerase or a ligase adds a nucleotide to a primeror to a preexisting nucleic acid strand in a template directed manner.In some embodiments of sequencing by synthesis, different types ofnucleotides, nucleotide analogues and/or identifiers are used. In someembodiments reversible terminators and/or removable (e.g., cleavable)identifiers are used. In some embodiments fluorescent labelednucleotides and/or nucleotide analogues are used. In certain embodimentssequencing by synthesis comprises a cleavage (e.g., cleavage and removalof an identifier) and/or a washing step. In some embodiments theaddition of one or more nucleotides is detected by a suitable methoddescribed herein or known in the art, non-limiting examples of whichinclude any suitable imaging apparatus, a suitable camera, a digitalcamera, a CCD (Charge Couple Device) based imaging apparatus (e.g., aCCD camera), a CMOS (Complementary Metal Oxide Silicon) based imagingapparatus (e.g., a CMOS camera), a photo diode (e.g., a photomultipliertube), electron microscopy, a field-effect transistor (e.g., a DNAfield-effect transistor), an ISFET ion sensor (e.g., a CHEMFET sensor),the like or combinations thereof. Other sequencing methods that may beused to conduct methods herein include digital PCR and sequencing byhybridization.

Other sequencing methods that may be used to conduct methods hereininclude digital PCR and sequencing by hybridization. Digital polymerasechain reaction (digital PCR or dPCR) can be used to directly identifyand quantify nucleic acids in a sample. Digital PCR can be performed inan emulsion, in some embodiments. For example, individual nucleic acidsare separated, e.g., in a microfluidic chamber device, and each nucleicacid is individually amplified by PCR. Nucleic acids can be separatedsuch that there is no more than one nucleic acid per well. In someembodiments, different probes can be used to distinguish various alleles(e.g. fetal alleles and maternal alleles). Alleles can be enumerated todetermine copy number.

In certain embodiments, sequencing by hybridization can be used. Themethod involves contacting a plurality of polynucleotide sequences witha plurality of polynucleotide probes, where each of the plurality ofpolynucleotide probes can be optionally tethered to a substrate. Thesubstrate can be a flat surface with an array of known nucleotidesequences, in some embodiments. The pattern of hybridization to thearray can be used to determine the polynucleotide sequences present inthe sample. In some embodiments, each probe is tethered to a bead, e.g.,a magnetic bead or the like. Hybridization to the beads can beidentified and used to identify the plurality of polynucleotidesequences within the sample.

In some embodiments, nanopore sequencing can be used in a methoddescribed herein. Nanopore sequencing is a single-molecule sequencingtechnology whereby a single nucleic acid molecule (e.g. DNA) issequenced directly as it passes through a nanopore.

A suitable MPS method, system or technology platform for conductingmethods described herein can be used to obtain nucleic acid sequencingreads. Non-limiting examples of MPS platforms includeIllumina/Solex/HiSeq (e.g., Illumina's Genome Analyzer; Genome AnalyzerII; HISEQ 2000; HISEQ), SOLiD, Roche/454, PACBIO and/or SMRT, HelicosTrue Single Molecule Sequencing, Ion Torrent and Ion semiconductor-basedsequencing (e.g., as developed by Life Technologies), WildFire, 5500,5500×1 W and/or 5500×1 W Genetic Analyzer based technologies (e.g., asdeveloped and sold by Life Technologies, US patent publication no.US20130012399); Polony sequencing, Pyrosequencing, Massively ParallelSignature Sequencing (MPSS), RNA polymerase (RNAP) sequencing, LaserGensystems and methods, Nanopore-based platforms, chemical-sensitive fieldeffect transistor (CHEMFET) array, electron microscopy-based sequencing(e.g., as developed by ZS Genetics, Halcyon Molecular), nanoballsequencing

In some embodiments, chromosome-specific sequencing is performed. Insome embodiments, chromosome-specific sequencing is performed utilizingDANSR (digital analysis of selected regions). Digital analysis ofselected regions enables simultaneous quantification of hundreds of lociby cfDNA-dependent catenation of two locus-specific oligonucleotides viaan intervening ‘bridge’ oligonucleotide to form a PCR template. In someembodiments, chromosome-specific sequencing is performed by generating alibrary enriched in chromosome-specific sequences. In some embodiments,sequence reads are obtained only for a selected set of chromosomes. Insome embodiments, sequence reads are obtained only for chromosomes 21,18 and 13.

Mapping Reads

Sequence reads can be mapped and the number of reads mapping to aspecified nucleic acid region (e.g., a chromosome, portion or segmentthereof) are referred to as counts. Any suitable mapping method (e.g.,process, algorithm, program, software, module, the like or combinationthereof) can be used. Certain aspects of mapping processes are describedhereafter.

Mapping nucleotide sequence reads (i.e., sequence information from afragment whose physical genomic position is unknown) can be performed ina number of ways, and often comprises alignment of the obtained sequencereads with a matching sequence in a reference genome. In suchalignments, sequence reads generally are aligned to a reference sequenceand those that align are designated as being “mapped”, “a mappedsequence read” or “a mapped read”. In certain embodiments, a mappedsequence read is referred to as a “hit” or “count”. In some embodiments,mapped sequence reads are grouped together according to variousparameters and assigned to particular portions, which are discussed infurther detail below.

As used herein, the terms “aligned”, “alignment”, or “aligning” refer totwo or more nucleic acid sequences that can be identified as a match(e.g., 100% identity) or partial match. Alignments can be done manuallyor by a computer (e.g., a software, program, module, or algorithm),non-limiting examples of which include the Efficient Local Alignment ofNucleotide Data (ELAND) computer program distributed as part of theIllumina Genomics Analysis pipeline. Alignment of a sequence read can bea 100% sequence match. In some cases, an alignment is less than a 100%sequence match (i.e., non-perfect match, partial match, partialalignment). In some embodiments an alignment is about a 99%, 98%, 97%,96%, 95%, 94%, 93%, 92%, 91%, 90%, 89%, 88%, 87%, 86%, 85%, 84%, 83%,82%, 81%, 80%, 79%, 78%, 77%, 76% or 75% match. In some embodiments, analignment comprises a mismatch. In some embodiments, an alignmentcomprises 1, 2, 3, 4 or 5 mismatches. Two or more sequences can bealigned using either strand. In certain embodiments a nucleic acidsequence is aligned with the reverse complement of another nucleic acidsequence.

Various computational methods can be used to map each sequence read to aportion. Non-limiting examples of computer algorithms that can be usedto align sequences include, without limitation, BLAST, BLITZ, FASTA,BOWTIE 1, BOWTIE 2, ELAND, MAQ, PROBEMATCH, SOAP or SEQMAP, orvariations thereof or combinations thereof. In some embodiments,sequence reads can be aligned with sequences in a reference genome. Insome embodiments, the sequence reads can be found and/or aligned withsequences in nucleic acid databases known in the art including, forexample, GenBank, dbEST, dbSTS, EMBL (European Molecular BiologyLaboratory) and DDBJ (DNA Databank of Japan). BLAST or similar tools canbe used to search the identified sequences against a sequence database.Search hits can then be used to sort the identified sequences intoappropriate portions (described hereafter), for example.

In some embodiments mapped sequence reads and/or information associatedwith a mapped sequence read are stored on and/or accessed from anon-transitory computer-readable storage medium in a suitablecomputer-readable format. A “computer-readable format” is sometimesreferred to generally herein as a format. In some embodiments mappedsequence reads are stored and/or accessed in a suitable binary format, atext format, the like or a combination thereof. A binary format issometimes a BAM format. A text format is sometimes a sequencealignment/map (SAM) format. Non-limiting examples of binary and/or textformats include BAM, SAM, SRF, FASTQ, Gzip, the like, or combinationsthereof. In some embodiments mapped sequence reads are stored in and/orare converted to a format that requires less storage space (e.g., lessbytes) than a traditional format (e.g., a SAM format or a BAM format).In some embodiments mapped sequence reads in a first format arecompressed into a second format requiring less storage space than thefirst format. The term “compressed” as used herein refers to a processof data compression, source coding, and/or bit-rate reduction where acomputer readable data file is reduced in size. In some embodimentsmapped sequence reads are compressed from a SAM format in a binaryformat. Some data sometimes is lost after a file is compressed.Sometimes no data is lost in a compression process. In some filecompression embodiments, some data is replaced with an index and/or areference to another data file comprising information regarding a mappedsequence read. In some embodiments a mapped sequence read is stored in abinary format comprising or consisting of a read count, a chromosomeidentifier (e.g., that identifies a chromosome to which a read ismapped) and a chromosome position identifier (e.g., that identifies aposition on a chromosome to which a read is mapped). In some embodimentsa binary format comprises a 20 byte array, a 16 byte array, an 8 bytearray, a 4 byte array or a 2 byte array. In some embodiments mapped readinformation is stored in an array in a 10 byte format, 9 byte format, 8byte format, 7 byte format, 6 byte format, 5 byte format, 4 byte format,3 byte format or 2 byte format. Sometimes mapped read data is stored ina 4 byte array comprising a 5 byte format. In some embodiments a binaryformat comprises a 5-byte format comprising a 1-byte chromosome ordinaland a 4-byte chromosome position. In some embodiments mapped reads arestored in a compressed binary format that is about 100 times, about 90times, about 80 times, about 70 times, about 60 times, about 55 times,about 50 times, about 45 times, about 40 times or about 30 times smallerthan a sequence alignment/map (SAM) format. In some embodiments mappedreads are stored in a compress binary format that is about 2 timessmaller to about 50 times smaller than (e.g., about 30, 25, 20, 19, 18,17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, or about 5 times smallerthan) a GZip format.

In some embodiments a system comprises a compression module. In someembodiments mapped sequence read information stored on a non-transitorycomputer-readable storage medium in a computer-readable format iscompressed by a compression module. A compression module sometimesconverts mapped sequence reads to and from a suitable format. Acompression module can accept mapped sequence reads in a first format,convert them into a compressed format (e.g., a binary format, 5) andtransfer the compressed reads to another module (e.g., a bias densitymodule 6) in some embodiments. A compression module often providessequence reads in a binary format 5 (e.g., a BReads format).Non-limiting examples of a compression module include GZIP, BGZF, andBAM, the like or modifications thereof).

The following provides an example of converting an integer into a 4-bytearray using java:

  public static final byte[ ] convertToByteArray(int value) { return newbyte[ ] { (byte)(value >>> 24), (byte)(value >>> 16), (byte)(value >>>8), (byte)value}; }

In some embodiments, a read may uniquely or non-uniquely map to portionsin a reference genome. A read is considered as “uniquely mapped” if italigns with a single sequence in the reference genome. A read isconsidered as “non-uniquely mapped” if it aligns with two or moresequences in the reference genome. In some embodiments, non-uniquelymapped reads are eliminated from further analysis (e.g. quantification).A certain, small degree of mismatch (0-1) may be allowed to account forsingle nucleotide polymorphisms that may exist between the referencegenome and the reads from individual samples being mapped, in certainembodiments. In some embodiments, no degree of mismatch is allowed for aread mapped to a reference sequence.

As used herein, the term “reference genome” can refer to any particularknown, sequenced or characterized genome, whether partial or complete,of any organism or virus which may be used to reference identifiedsequences from a subject. For example, a reference genome used for humansubjects as well as many other organisms can be found at the NationalCenter for Biotechnology Information at World Wide Web URLncbi.nlm.nih.gov. A “genome” refers to the complete genetic informationof an organism or virus, expressed in nucleic acid sequences. As usedherein, a reference sequence or reference genome often is an assembledor partially assembled genomic sequence from an individual or multipleindividuals. In some embodiments, a reference genome is an assembled orpartially assembled genomic sequence from one or more human individuals.In some embodiments, a reference genome comprises sequences assigned tochromosomes.

In certain embodiments, where a sample nucleic acid is from a pregnantfemale, a reference sequence sometimes is not from the fetus, the motherof the fetus or the father of the fetus, and is referred to herein as an“external reference.” A maternal reference may be prepared and used insome embodiments. When a reference from the pregnant female is prepared(“maternal reference sequence”) based on an external reference, readsfrom DNA of the pregnant female that contains substantially no fetal DNAoften are mapped to the external reference sequence and assembled. Incertain embodiments the external reference is from DNA of an individualhaving substantially the same ethnicity as the pregnant female. Amaternal reference sequence may not completely cover the maternalgenomic DNA (e.g., it may cover about 50%, 60%, 70%, 80%, 90% or more ofthe maternal genomic DNA), and the maternal reference may not perfectlymatch the maternal genomic DNA sequence (e.g., the maternal referencesequence may include multiple mismatches).

In certain embodiments, mappability is assessed for a genomic region(e.g., portion, genomic portion, portion). Mappability is the ability tounambiguously align a nucleotide sequence read to a portion of areference genome, typically up to a specified number of mismatches,including, for example, 0, 1, 2 or more mismatches. For a given genomicregion, the expected mappability can be estimated using a sliding-windowapproach of a preset read length and averaging the resulting read-levelmappability values. Genomic regions comprising stretches of uniquenucleotide sequence sometimes have a high mappability value.

Portions

In some embodiments, mapped sequence reads (i.e. sequence tags) aregrouped together according to various parameters and assigned toparticular portions (e.g., portions of a reference genome). Often,individual mapped sequence reads can be used to identify a portion(e.g., the presence, absence or amount of a portion) present in asample. In some embodiments, the amount of a portion is indicative ofthe amount of a larger sequence (e.g. a chromosome) in the sample. Theterm “portion” can also be referred to herein as a “genomic section”,“bin”, “region”, “partition”, “portion of a reference genome”, “portionof a chromosome” or “genomic portion.” In some embodiments a portion isan entire chromosome, a segment of a chromosome, a segment of areference genome, a segment spanning multiple chromosome, multiplechromosome segments, and/or combinations thereof. In some embodiments, aportion is predefined based on specific parameters. In some embodiments,a portion is arbitrarily defined based on partitioning of a genome(e.g., partitioned by size, GC content, sequencing coverage variability,contiguous regions, contiguous regions of an arbitrarily defined size,and the like). Methods for partitioning a genome (e.g., a referencegenome, or part thereof) are provided herein and described in furtherdetail below.

In some embodiments, a portion is delineated based on one or moreparameters which include, for example, length or a particular feature orfeatures of the sequence. Portions can be selected, filtered and/orremoved from consideration using any suitable criteria know in the artor described herein. In some embodiments, a portion is based on aparticular length of genomic sequence. In some embodiments, a method caninclude analysis of multiple mapped sequence reads to a plurality ofportions. Portions can be approximately the same length or portions canbe different lengths. In some embodiments, portions are of about equallength. In some embodiments, portions are not of equal length. In someembodiments, portions are of a first equal length within a certaingenomic region of interest, and are of a second equal length within adifferent genomic region of interest. For example, portions may be 30 kbin length within genomic region A and may be 70 kb in genomic region B.Methods for optimizing portion length within a genomic region ofinterest are provided herein and described in further detail below. Insome embodiments portions of different lengths are adjusted or weighted.In some embodiments, a genome is partitioned according to an initialportion length and then re-partitioned according to one or more optimalportion lengths. In some embodiments, a portion is about 1 kilobase (kb)to about 1000 kb, about 1 kb to about 500 kb, about 10 kb to about 300kb, about 10 kb to about 100 kb, about 20 kb to about 80 kb, about 30 kbto about 70 kb, about 40 kb to about 60 kb, and sometimes about 50 kb.In some embodiments, a portion is not 50 kb. In some embodiments, aportion is about 10 kb to about 20 kb. In some embodiments, a portion isabout 30 kb. In some embodiments, a portion is about 10 kb. In someembodiments, a portion is about 20 kb. In some embodiments, a portion isabout 30 kb. In some embodiments, a portion is about 40 kb. In someembodiments, a portion is about 50 kb. In some embodiments, a portion isabout 60 kb. In some embodiments, a portion is about 70 kb. In someembodiments, a portion is about 80 kb. In some embodiments, a portion isabout 90 kb. In some embodiments, a portion is about 100 kb. In someembodiments, a portion is about 30 kb to about 300 kb. In someembodiments, a portion is about 32 kb. In some embodiments, a portion isabout 64 kb. In some embodiments, a portion is about 128 kb. In someembodiments, a portion is about 256 kb. A portion is not limited tocontiguous runs of sequence. Thus, portions can be made up of contiguousand/or non-contiguous sequences. A portion is not limited to a singlechromosome. In some embodiments, a portion includes all or part of onechromosome or all or part of two or more chromosomes. In someembodiments, portions may span one, two, or more entire chromosomes. Inaddition, portions may span jointed or disjointed regions of multiplechromosomes.

In some embodiments, portions can be particular chromosome segments in achromosome of interest, such as, for example, a chromosome where agenetic variation is assessed (e.g. an aneuploidy of chromosomes 13, 18and/or 21 or a sex chromosome). A portion can also be a pathogenicgenome (e.g. bacterial, fungal or viral) or fragment thereof. Portionscan be genes, gene fragments, regulatory sequences, introns, exons, andthe like.

A “segment” of a chromosome generally is part of a chromosome, andtypically is a different part of a chromosome than a portion. A segmentof a chromosome sometimes is in a different region of a chromosome thana portion, sometimes does not share a polynucleotide with a portion, andsometimes includes a polynucleotide that is in a portion. A segment of achromosome often contains a larger number of nucleotides than a portion(e.g., a segment sometimes includes a portion), and sometimes a segmentof a chromosome contains a smaller number of nucleotides than a portion(e.g., a segment sometimes is within a portion). A “genomic region,” asused herein, often contains a larger number of nucleotides than aportion (e.g., a genomic region sometimes includes a portion or aplurality of portions).

Genome Partitioning

In some embodiments, a genome (e.g. human genome, reference genome, partof a reference genome, genomic region, one or more chromosomes, segmentof a chromosome) is partitioned into portions based on informationcontent of particular regions and/or other parameters. Genomepartitioning sometimes is referred to as discretization, binning,segmenting, segmentation, portioning, dividing, grouping, aggregating,and aggregation. In some embodiments, a genome is partitioned accordingto guanine and cytosine (GC) content. In some embodiments, a genome ispartitioned according to sequencing coverage variability. In someembodiments, partitioning a genome may eliminate or reduce biasesassociated with information content of particular regions and/or otherparameters. In some embodiments, partitioning a genome may establish afine grid (i.e., small portions) for certain regions and a coarse grid(i.e., large portions) for other regions. In some embodiments,partitioning a genome may eliminate similar regions (e.g., identical orhomologous regions or sequences) across the genome and only keep uniqueregions. Regions removed during partitioning may be within a singlechromosome or may span multiple chromosomes. In some embodiments apartitioned genome is trimmed down and optimized for faster alignment,often allowing for focus on uniquely identifiable sequences.

In some embodiments, partitioning of a genome into regions transcendingchromosomes may be based on information gain produced in the context ofclassification. For example, information content may be quantified usinga p-value profile measuring the significance of particular genomiclocations for distinguishing between groups of confirmed normal andabnormal subjects (e.g. euploid and trisomy subjects, respectively). Insome embodiments, partitioning of a genome into regions transcendingchromosomes may be based on any other criterion, such as, for example,speed/convenience while aligning tags, GC content (e.g., high or low GCcontent), uniformity of GC content, other measures of sequence content(e.g. fraction of individual nucleotides, fraction of pyrimidines orpurines, fraction of natural vs. non-natural nucleic acids, fraction ofmethylated nucleotides, and CpG content), sequencing coveragevariability, methylation state, duplex melting temperature, amenabilityto sequencing or PCR, uncertainty value assigned to individual portionsof a reference genome, and/or a targeted search for particular features.Provided herein, for example, are methods for partitioning a genomeaccording to GC content. Also provided herein, for example, are methodsfor partitioning a genome according to sequencing coverage variability.

GC Partitioning

In some embodiments, a genome is partitioned according to guanine andcytosine (GC) content. GC partitioning sometimes is referred to hereinas “wavelet binning.” Often, each chromosome, or a portion of eachchromosome, is partitioned separately from other chromosomes (i.e., oneat a time) in the reference genome. While the method described below isgenerally applied to a single chromosome, one or more or all chromosomesin a reference genome may be partitioned according to the followingmethod.

In some embodiments, partitioning a genome according to GC contentcomprises generating a GC profile for a chromosome, or a segment of achromosome. GC profiles can be generated by quantifying the GC content(i.e., number of guanine and cytosine bases) for a given length ofgenomic sequence (i.e. window) throughout a chromosome, or part thereof,in a reference genome. Windows generally are relatively short lengths ofgenomic sequence (e.g., 100 bases to 10 kilobases (kb)). GC contentgenerally is determined for contiguous windows throughout a chromosomeor segment thereof. In certain instances, windows are 1 kb. Thus, forexample, a GC profile may be generated by quantifying the GC content per1 kb contiguous window throughout a chromosome in a reference genome.

In some embodiments, partitioning a genome according to GC contentcomprises segmenting. In some embodiments segmenting modifies and/ortransforms a profile (e.g., a GC profile) thereby providing one or moredecomposition renderings of a profile. A profile subjected to asegmenting process often is a profile of GC content in a referencegenome or part thereof (e.g., autosomes and sex chromosomes). Adecomposition rendering of a profile is often a transformation of aprofile. A decomposition rendering of a profile is sometimes atransformation of a profile into a representation of a genome,chromosome or segment thereof.

In certain embodiments a segmenting process utilized for the segmentinglocates and identifies one or more GC content levels within a profilethat are different (e.g., substantially or significantly different) thanone or more other GC content levels within a profile. A GC content levelidentified in a profile according to a segmenting process that isdifferent than another GC content level in the profile, and has edgesthat are different than another GC content level in the profile, isreferred to herein as a wavelet, and more generally as a GC contentlevel for a discrete segment. A segmenting process can generate, from aprofile of GC content or GC content levels, a decomposition rendering inwhich one or more discrete segments or wavelets can be identified. Adiscrete segment generally is shorter than what is segmented (e.g.,chromosome, chromosomes, autosomes).

In some embodiments segmenting locates and identifies edges of discretesegments and wavelets within a profile. In certain embodiments one orboth edges of one or more discrete segments and wavelets are identified.For example, a segmentation process can identify the location (e.g.,genomic coordinates, e.g., portion location) of the right and/or theleft edges of a discrete segment or wavelet in a profile. A discretesegment or wavelet often comprises two edges. For example, a discretesegment or wavelet can include a left edge and a right edge. In someembodiments, depending upon the representation or view, a left edge canbe a 5′-edge and a right edge can be a 3′-edge of a nucleic acid segmentin a profile. In some embodiments a left edge can be a 3′-edge and aright edge can be a 5′-edge of a nucleic acid segment in a profile.Often the edges of a profile are known prior to segmentation andtherefore, in some embodiments, the edges of a profile determine whichedge of a level is a 5′-edge and which edge is 3′-edge. In someembodiments one or both edges of a profile and/or discrete segment(e.g., wavelet) is an edge of a chromosome.

In some embodiments the edges of a discrete segment or wavelet aredetermined according to a decomposition rendering generated for areference sample (e.g., a reference profile). In some embodiments a nulledge height distribution is determined according to a decompositionrendering of a reference profile (e.g., a profile of a chromosome orsegment thereof). In certain embodiments, the edges of a discretesegment or wavelet in a profile are identified when the level of thediscrete segment or wavelet is outside a null edge height distribution.In some embodiments the edges of a discrete segment or wavelet in aprofile are identified according a Z-score calculated according to adecomposition rendering for a reference profile.

Sometimes segmenting generates two or more discrete segments or wavelets(e.g., two or more fragmented levels, two or more fragmented segments)in a profile. In some embodiments a decomposition rendering derived froma segmenting process is over-segmented or fragmented and comprisesmultiple discrete segments or wavelets. Sometimes discrete segments orwavelets generated by segmenting are substantially different andsometimes discrete segments or wavelets generated by segmenting aresubstantially similar. Substantially similar discrete segments orwavelets (e.g., substantially similar levels) often refers to two ormore adjacent discrete segments or wavelets in a segmented profile eachhaving a GC content level that differs by less than a predeterminedlevel of uncertainty. In some embodiments substantially similar discretesegments or wavelets are adjacent to each other and are not separated byan intervening segment or wavelet. In some embodiments substantiallysimilar discrete segments or wavelets are separated by one or moresmaller segments or wavelets. Substantially different discrete segmentsor wavelets, in some embodiments are not adjacent. Substantiallydifferent discrete segments or wavelets generally have substantiallydifferent GC content levels.

In some embodiments a segmentation process comprises determining (e.g.,calculating) a GC content level (e.g., a quantitative value, e.g., amean or median level), a level of uncertainty (e.g., an uncertaintyvalue), Z-score, Z-value, p-value, the like or combinations thereof forone or more discrete segments or wavelets (e.g., GC content levels) in aprofile or segment thereof. In some embodiments a GC content level(e.g., a quantitative value, e.g., a mean or median level), a level ofuncertainty (e.g., an uncertainty value), Z-score, Z-value, p-value, thelike or combinations thereof are determined (e.g., calculated) for adiscrete segment or wavelet.

In some embodiments, segmenting is accomplished by a process thatcomprises one process or multiple sub-processes, non-limiting examplesof which include a decomposition generating process (e.g., a waveletdecomposition generating process), thresholding, leveling, smoothing,the like or combination thereof. Thresholding, leveling, smoothing andthe like can be performed in conjunction with a decomposition generatingprocess, and are described hereafter with reference to a waveletdecomposition rendering process.

In some embodiments, segmenting is performed according to a waveletdecomposition generating process. In some embodiments segmenting isperformed according to two or more wavelet decomposition generatingprocesses. In some embodiments a wavelet decomposition generatingprocess identifies one or more wavelets in a profile and provides adecomposition rendering of a profile.

Segmenting can be performed, in full or in part, by any suitable waveletdecomposition generating process described herein or known in the art.Non-limiting examples of a wavelet decomposition generating processinclude a Haar wavelet segmentation (Haar, Alfred (1910) “Zur Theorieder orthogonalen Funktionensysteme”, Mathematische Annalen 69 (3):331-371; Nason, G. P. (2008) “Wavelet methods in Statistics”, R.Springer, New York)(e.g., WaveThresh), Wavethresh, a suitable recursivebinary segmentation process, circular binary segmentation (CBS) (Olshen,A B, Venkatraman, E S, Lucito, R, Wigler, M (2004) “Circular binarysegmentation for the analysis of array-based DNA copy number data”,Biostatistics, 5, 4:557-72; Venkatraman, E S, Olshen, A B (2007) “Afaster circular binary segmentation algorithm for the analysis of arrayCGH data”, Bioinformatics, 23, 6:657-63), Maximal Overlap DiscreteWavelet Transform (MODWT) (L. Hsu, S. Self, D. Grove, T. Randolph, K.Wang, J. Delrow, L. Loo, and P. Porter, “Denoising array-basedcomparative genomic hybridization data using wavelets”, Biostatistics(Oxford, England), vol. 6, no. 2, pp. 211-226, 2005), stationary wavelet(SWT) (Y. Wang and S. Wang, “A novel stationary wavelet denoisingalgorithm for array-based DNA copy number data”, International Journalof Bioinformatics Research and Applications, vol. 3, no. 2, pp. 206-222,2007), dual-tree complex wavelet transform (DTCWT) (Nha, N., H. Heng, S.Oraintara and W. Yuhang (2007) “Denoising of Array-Based DNA Copy NumberData Using The Dual-tree Complex Wavelet Transform.” 137-144),convolution with edge detection kernel, Jensen Shannon Divergence,Kullback-Leibler divergence, Binary Recursive Segmentation, a Fouriertransform, the like or combinations thereof.

A wavelet decomposition generating process can be represented orperformed by a suitable software, module and/or code written in asuitable language (e.g., a computer programming language known in theart) and/or operating system, non-limiting examples of which includeUNIX, Linux, oracle, windows, Ubuntu, ActionScript, C, C++, C#, Haskell,Java, JavaScript, Objective-C, Perl, Python, Ruby, Smalltalk, SQL,Visual Basic, COBOL, Fortran, UML, HTML (e.g., with PHP), PGP, G, R, S,the like or combinations thereof. In some embodiments a suitable waveletdecomposition generating process is represented in S or R code or by apackage (e.g., an R package). R, R source code, R programs, R packagesand R documentation for wavelet decomposition generating processes areavailable for download from a CRAN or CRAN mirror site (e.g., TheComprehensive R Archive Network (CRAN); World Wide Web URLcran.us.r-project.org). CRAN is a network of ftp and web servers aroundthe world that store identical, up-to-date, versions of code anddocumentation for R. For example, WaveThresh (WaveThresh:

Wavelets statistics and transforms; World Wide Web URLcran.r-project.org/web/packages/wavethresh/index.html) and a detaileddescription of WaveThresh (Package ‘wavethresh’; World Wide Web URLcran.r-project.org/web/packages/wavethresh/wavethresh.pdf) can beavailable for download. An example of R code for a CBS method can bedownloaded (e.g., DNAcopy; World Wide Web URLbioconductor.org/packages/2.12/bioc/html/DNAcopy.html or Package‘DNAcopy’; World Wide Web URLbioconductor.org/packages/release/bioc/manuals/DNAcopy/man/DNAcopy.pdf).

In some embodiments a wavelet decomposition generating process (e.g., aHaar wavelet segmentation, e.g., WaveThresh) comprises thresholding. Insome embodiments thresholding distinguishes signals from noise. Incertain embodiments thresholding determines which wavelet coefficients(e.g., nodes) are indicative of signals and should be retained and whichwavelet coefficients are indicative of a reflection of noise and shouldbe removed. In some embodiments thresholding comprises one or morevariable parameters where a user sets the value of the parameter. Insome embodiments thresholding parameters (e.g., a thresholdingparameter, a policy parameter) can describe or define the amount ofsegmentation utilized in a wavelet decomposition generating process. Anysuitable parameter values can be used. In some embodiments athresholding parameter is used. In some embodiments a thresholdingparameter value is a soft thresholding. In certain embodiments a softthresholding is utilized to remove small and non-significantcoefficients. In certain embodiments a hard thresholding is utilized. Incertain embodiments a thresholding comprises a policy parameter. Anysuitable policy value can be used.

In some embodiments a policy used is “universal” and in some embodimentsa policy used is “sure”.

In some embodiments, a wavelet decomposition generating process (e.g., aHaar wavelet segmentation, e.g., WaveThresh) comprises leveling. In someembodiments, after thresholding, some high level coefficients remain.These coefficients represent steep changes or large spikes in theoriginal signal and, in certain embodiments, are removed by leveling. Insome embodiments leveling includes assignment of a value to a parameterknown as a decomposition level c. In certain embodiments an optimaldecomposition level is determined according to one or more determinedvalues, such as the length of the chromosome (e.g., length of profile),the desired wavelet length to detect, fetal fraction, sequence coverage(e.g., plex level) and the noise level of a normalized profile. For agiven length of a segment of a genome, chromosome or profile (L_(chr)),the wavelet decomposition level c is sometimes related to the minimumwavelet length or minimum portion length L_(min) according to theequation L_(min)=L_(chr)/2^(c+1). In some embodiments, a decompositionlevel c is determined according to one of the following equations:c=log₂(L_(chr)/L_(min)); c=log₂(L_(chr)/L_(min))+1;c=log₂(L_(chr)/L_(min))−1. In some embodiments, a decomposition level cis about 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10. In some embodiments L_(min) ispredetermined. In some embodiments L_(min) is predetermined according toa limit of detection (LoD) analysis, as described in Example 1. In someembodiments the amount of sequence coverage (e.g., plex level) and fetalfraction are inversely proportional to L_(min). For example, the minimumdesired wavelet length (e.g., minimum portion length) decreases (i.e.resolution increases) as the amount of fetal fraction in a sampleincreases. In some embodiments the minimum desired wavelet length (e.g.,minimum portion length) decreases (i.e. resolution increases) as thesequencing coverage increases. In some embodiments thresholding isperformed prior to leveling and sometimes thresholding is performedafter leveling.

In some embodiments, a decomposition rendering is polished therebyproviding a polished decomposition rendering. In some embodiments adecomposition rendering is polished two or more times. In someembodiments a decomposition rendering is polished before and/or afterone or more steps of a segmenting process. In some embodiments genomepartitioning comprises two or more segmenting processes and eachsegmenting process comprises one or more polishing processes. Adecomposition rendering can refer to a polished decomposition renderingor a decomposition rendering that is not polished.

Thus, in some embodiments a segmenting process comprises polishing. Insome embodiments a polishing process identifies two or moresubstantially similar discrete segments or wavelets (e.g., in adecomposition rendering) and merges them into a single discrete segmentor wavelet. In some embodiments a polishing process identifies two ormore adjacent segments or wavelets that are substantially similar andmerges them into a single level, segment or wavelet. Thus, in someembodiments a polishing process comprises a merging process. In certainembodiments adjacent fragmented discrete segments or wavelets are mergedaccording to their GC content levels. In some embodiments merging two ormore adjacent discrete segments or wavelets comprises calculating amedian level for the two or more adjacent discrete segments or waveletsthat are eventually merged. In some embodiments, two or more adjacentdiscrete segments or wavelets that are substantially similar are mergedand thereby polished resulting in a single segment, wavelet or GCcontent level. In certain embodiments, two or more adjacent discretesegments or wavelets are merged by a process described by Willenbrockand Fridly (Willenbrock H, Fridlyand J. A comparison study: applyingsegmentation to array CGH data for downstream analyses. Bioinformatics(2005) Nov. 15; 21(22):4084-91). In some embodiments two or moreadjacent discrete segments or wavelets are merged by a process known asGLAD and described in Hupe, P. et al. (2004) “Analysis of array CGHdata: from signal ratio to gain and loss of DNA regions”,Bioinformatics, 20, 3413-3422.

In some embodiments, a segmenting process comprises a “sliding edges” or“sliding window” process. A suitable “sliding edges” process can be useddirectly or can be adapted for validating a discrete segment in adecomposition rendering. In some embodiments, a “sliding edges” processcomprises segmenting a discrete segment into multiple subsets ofportions. In some embodiments, the discrete segment is a set of portionsfor a whole chromosome or a segment of a chromosome.

In certain embodiments a “sliding edges” process comprises segmenting anidentified discrete segment into multiple subsets of portions where eachof the subsets of portions represents a discrete segment with similar,but different edges. In some embodiments the originally identifieddiscrete segment is included in the analysis. For example, theoriginally identified discrete segment is included as one of themultiple subsets of portions. The subsets of portions can be determinedby varying one or both edges of the originally identified discretesegment by any suitable method. In some embodiments the left edge can bechanged thereby generating discrete segments with different left edges.In some embodiments the right edge can be changed thereby generatingdiscrete segments with different right edges. In some embodiments boththe right and left edges can be changed. In some embodiments the edgesare changed by moving the edge by one or more adjacent portions of areference genome to the left or to the right of the original edges.

In some embodiments either one or both edges are changed by 5 to 30portions of a reference genome. In some embodiments an edge is moved by1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,21, 22, 23, 24, 25, 26, 27, 28, 29 or 30 portions of a reference genomein either direction. In some embodiments, regardless of the portionsize, an edge is changed to generate an edge range of about 100,000 toabout 2,000,000 base pairs, 250,000 to about 1,500,000 base pairs, orabout 500,000 to about 1,000,000 base pairs for either or both edges. Insome embodiments, regardless of the portion size, an edge is changed togenerate an edge range of about 500,000, 600,000, 700,000, 750,000,800,000, 900,000, or about 1,000,000 bases pairs for either or bothedges.

In some embodiments an identified discrete segment comprises a first endand a second end and the segmenting comprises (i) removing one or moreportions from the first end of the set of portions by recursive removalthereby providing a subset of portions with each recursive removal, (ii)terminating the recursive removal in (i) after n repeats therebyproviding n+1 subsets of portions, where the set of portions is asubset, and where each subset comprises a different number of portions,a first subset end and a second subset end, (iii) removing one or moreportions from the second subset end of each of the n+1 subsets ofportions provided in (ii) by recursive removal; and (iv) terminating therecursive removal in (iii) after n repeats, thereby providing multiplesubsets of portions. In some embodiments the multiple subsets equals(n+1)2 subsets. In some embodiments n is equal to an integer between 5and 30. In some embodiments n is equal to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,29 or 30.

In certain embodiments of a sliding edges approach, a level ofsignificance (e.g., a Z-score, a p-value) is determined for each of thesubsets of portions of a reference genome and an average, mean or medianlevel of significance is determined according to the level ofsignificance determined for all of the subsets.

In some embodiments the level of significance is a Z-score or a p-value.In some embodiments a Z-score is calculated according to the followingformula:

Z _(i)=(E _(i)−Med.E _((n)))/MAD

where E is a quantitative determination of the level of the discretesegment i, Med.E (n) is the median level for all discrete segmentsgenerated by a sliding edges process and MAD is the median absolutedeviation for Med.E (n), and Z_(i) is the resulting Z-score for discretesegment i. In some embodiments MAD can be replaced by any suitablemeasure of uncertainty. In some embodiments E_(i) is any suitablemeasure of a level, non-limiting examples of which include a medianlevel, average level, mean level, sum of the counts for the portions, orthe like.

In some embodiments, a segmenting process as described above is appliedto a GC profile to provide discrete segments. A segmenting process canbe performed on the GC content levels, as described herein. In certaininstances, windows having similar GC content levels are merged into thediscrete segments during the segmenting process. In certain embodiments,a segmenting process generates a decomposition rendering comprising thediscrete segments. In certain embodiments, a chromosome is partitionedinto a plurality of portions according to the discrete segments. Thus,in certain embodiments, the location and length of the discrete segmentsare the same or are similar to the location and length of the portionsin a GC partitioned chromosome.

Sequencing Coverage Variability Partitioning

In some embodiments, a genome is partitioned according to sequencingcoverage variability. In certain instances, sequencing of nucleic acidcomprising a mixture of maternal and fetal genetic material (e.g.,ccfDNA) can be characterized by variations in sequencing coverage as afunction of genomic location. Without being limited by theory, certaingenomic regions may provide a wealth of sequencing data while otherregions may provide sparse sequencing data. Optimizing the portionlength for certain regions in view of sequencing coverage variabilitymay allow for use of a fine grid (i.e., small portions) for certainregions and a coarse grid (i.e., large portions) for other regions. Afine grid may be useful, for example, for detection of small geneticvariations (e.g., small microdeletions or microduplications). A coarsegrid may be useful, for example, for capturing sequence read data thatotherwise may be filtered out using a smaller or standard grid.

In some embodiments, genome partitioning comprises determiningsequencing coverage variability across a reference genome. In someembodiments, a training set of nucleotide sequence reads mapped toportions of a reference genome is used for determining sequencingcoverage variability. A training set may include, for example,nucleotide sequence reads from a plurality of samples comprising amixture of maternal and fetal ccf nucleic acid (e.g., ccfDNA).Sequencing coverage variability across a reference genome can bedetermined by the quantification of sequence reads for the training set.Quantification of sequence reads may include quantification of rawsequence reads and/or quantification of normalized sequence reads forone or more portions or regions as described herein. In certaininstances, an average sequence read count is determined. In certaininstances, an average normalized sequence read count is determined.

In some embodiments, genome partitioning comprises selecting an initialportion length. Initial portion length can be selected, for example,according to certain features of the training set. For example, initialportion length can be selected according to sequencing depth for thetraining set. In certain instances, initial portion length can beselected according to an average fetal fraction for the training set. Insome embodiments, the initial portion length is selected according tosequencing depth and average fetal fraction for the training set.Average fetal fraction for the training set can be determined using anysuitable method for determining fetal fraction known in the art ordescribed herein (e.g., portion-specific fetal fraction determination).In some embodiments, average fetal fraction for the training set isknown or can be calculated based on sample records. Initial portionlength can be between about 1 kb to about 1000 kb. In some embodiments,initial portion length is about 10 kb. In some embodiments, initialportion length is about 20 kb. In some embodiments, initial portionlength is about 30 kb. In some embodiments, initial portion length isabout 40 kb. In some embodiments, initial portion length is about 50 kb.In some embodiments, initial portion length is about 60 kb. In someembodiments, initial portion length is about 70 kb. In some embodiments,initial portion length is about 80 kb. In some embodiments, initialportion length is about 90 kb. In some embodiments, initial portionlength is about 100 kb. In some embodiments, initial portion length isnot 50 kb. Generally, a larger initial portion length (e.g., larger than50 kb) may be selected for a training set with a low average fetalfraction (e.g., less than 10%) and/or low sequencing depth, and asmaller initial portion length (e.g., shorter than 50 kb) may beselected for a training set with a high average fetal fraction (e.g.,10% to 20%) and/or high sequencing depth. In some embodiments, a totalnumber of portions for a reference genome can be determined according tothe initial portion length and total genome size.

In some embodiments, genome partitioning comprises partitioning at leasttwo genomic regions according to initial portion size. A genomic regionmay be selected according to one or more known genetic variations (e.g.,any form of copy number variation) that may exist within the region(e.g., microdeletions, microduplications, aneuploidies), or may beselected at random. A genomic region may be a chromosome or a segment ofa chromosome. Generally, pairs of genomic regions are selected for acomparison of sequencing coverage variability, as described below.Additional pairings of genomic regions can be selected as needed. Afirst pair of genomic regions may comprise a first genomic region and asecond genomic region. Often, the first genomic region and the secondgenomic are substantially similar or equal in size (i.e. length). Forexample, a first genomic region and a second genomic may differ inlength by about 1 kb or less.

In some embodiments, genome partitioning comprises comparing sequencingcoverage variability for each pair of genomic regions. Comparingsequencing coverage variability may comprise calculating aproportionality factor (P) according to the following equation:

P=(var₁/var₂){circumflex over ( )}1/3  equation A

where var₁ is the sequencing coverage variability of the first genomicregion and var₂ is the sequencing coverage variability of the secondgenomic region. In some embodiments, sequencing coverage variability ofa first genomic region is determined from a nucleotide sequence readcount, or a derivative thereof, for the first genomic region, andsequencing coverage variability of a second genomic region is determinedfrom a nucleotide sequence read count, or a derivative thereof, for thesecond genomic region. As used herein, a derivative of a sequence readcount may be a processed sequence read count (e.g., filtered, adjusted,and/or normalized sequence read count, as described herein). In someembodiments, sequencing coverage variability of a first genomic regionis determined from an average nucleotide sequence read count, or aderivative thereof, for the first genomic region, and sequencingcoverage variability of a second genomic region is determined from anaverage nucleotide sequence read count, or a derivative thereof, for thesecond genomic region. In some embodiments, average nucleotidesequencing read counts for each genomic region are determined using thetraining set. In some embodiments, nucleotide sequence read count is anormalized nucleotide sequence read count. In some embodiments, averagenucleotide sequence read count is an average normalized nucleotidesequence read count.

In some embodiments, genome partitioning comprises recalculating thenumber of portions for a genomic region. Recalculating the number ofportions for a genomic region typically is according to the comparisonof sequence coverage variability described above. In certain instances,recalculating the number of portions for a genomic region is accordingto a proportionality factor (e.g., the proportionality factor describedabove). In some embodiments, recalculating the number of portions for agenomic region is performed according to a proportionality factor and atotal number of portions (e.g., for a reference genome) determined fromthe initial portion size, as described above. For example, let N1 be thenumber of portions for a genomic region 1, let N2 be the number ofportions for a genomic region 2, and let N3 be the number of portionsfor a genomic region 3. Let N be the total number of regions and theratios between these numbers (derived from equation A above) as follows:

N1/N2=P1

N1/N3=P2

Then,

N1+N2+N3=N

And given how N2=N1/P1 and N3=N1/P2 it follows that

N1=N*P1*P2/(P1*P2+P1+P2),

N2=N*P2/(P1*P2+P1+P2), and

N3=N*P1/(P1*P2+P1+P2).

In some embodiments, genome partitioning comprises determining anoptimized portion length for a genomic region according to therecalculated number of portions. An optimized portion length can bebetween about 1 kilobase (kb) to about 1000 kb. In some embodiments, anoptimized portion length is between about 1 kb to about 500 kb, betweenabout 10 kb to about 300 kb, between about 10 kb to about 100 kb,between about 20 kb to about 80 kb, between about 30 kb to about 70 kb,or between about 40 kb to about 60 kb. In some embodiments, an optimizedportion length is not 50 kb. In some embodiments, an optimized portionlength is about 10 kb to about 20 kb. In some embodiments, an optimizedportion length is about 30 kb. In some embodiments, an optimized portionlength is about 10 kb. In some embodiments, an optimized portion lengthis about 20 kb. In some embodiments, an optimized portion length isabout 30 kb. In some embodiments, an optimized portion length is about40 kb. In some embodiments, an optimized portion length is about 50 kb.In some embodiments, an optimized portion length is about 60 kb. In someembodiments, an optimized portion length is about 70 kb. In someembodiments, an optimized portion length is about 80 kb. In someembodiments, an optimized portion length is about 90 kb. In someembodiments, an optimized portion length is about 100 kb.

In some embodiments, genome partitioning comprises re-partitioning agenomic region into a plurality of portions according to an optimizedportion size. In some embodiments, a plurality of portions comprisesportions of constant (i.e., equal or substantially equal) length. Insome embodiments, a plurality of portions comprises portions of varyingsize. In certain instances, a genome partitioning method may compriseadditional methods of genome partitioning (e.g., GC partitioningdescribed herein) in conjunction, which may provide portions of varyingsize. In some embodiments, genome partitioning comprises re-partitioningone or more additional genomic regions of a reference genome using amethod described herein. In some embodiments, genome partitioningcomprises re-partitioning all or substantially all of the genomicregions of a reference genome using a method described herein.

In some embodiments, genome partitioning comprises estimating fetalfraction for a test sample. Fetal fraction can be estimated using anysuitable method for estimating fetal fraction known in the art ordescribed herein (e.g., portion-specific fetal fraction estimation,fetal quantifier assay, SNP-based fetal fraction estimation, Ychromosome fetal fraction estimation). Estimating fetal fractionsometimes comprises determining an error value. An error value may beexpressed as (or represent), for example, an uncertainty value, acalculated variance, standard deviation, Z-score, p-value, mean absolutedeviation, average absolute deviation, median absolute deviation, andthe like. In some embodiments an error value defines a range above andbelow an estimated fetal fraction. In some embodiment, error isexpressed as a range of values (e.g., confidence interval). In someembodiments, a region-specific fetal fraction is determined for agenomic region according to a correlation between nucleotide sequenceread counts (e.g., raw sequence read counts, normalized sequence readcounts) per portion and a weighting factor (e.g., portion-specific fetalfraction determination, as described herein).

In some embodiments, genome partitioning comprises determining a minimumgenomic region size (i.e., length). In some embodiments, genomepartitioning comprises determining a minimum genomic region size that isdetectable for a sample having a given fetal fraction (e.g., asestimated according to a method described above). In certain instances,a minimum genomic region size is determined according to a limit ofdetection (LoD) analysis, as described in Example 1, and as provided inFIG. 7 for certain genetic abnormalities. In certain instances, aminimum genomic region size is determined according to a particularconfidence interval of fetal fraction. For example, a minimum genomicregion size can be determined according to an upper 80% confidenceinterval of fetal fraction. In certain embodiments, a minimum genomicregion size can be determined according to an upper 90% confidenceinterval of fetal fraction. In certain embodiments, a minimum genomicregion size can be determined according to an upper 95% confidenceinterval of fetal fraction. In certain embodiments, a minimum genomicregion size can be determined according to an upper 99% confidenceinterval of fetal fraction.

In some embodiments, genome partitioning comprises determining a localminimum genomic region size (i.e., length). “Local” refers to within aparticular genomic region being repartitioned. In some embodiments,genome partitioning comprises determining a local genomic region sizethat is detectable for a sample having an average fetal fraction. Anaverage fetal fraction may be between about 5% to about 20%. Forexample, an average fetal fraction may be about 5.5%, 6%, 6.5%, 7%,7.5%, 8%, 8.5%, 9%, 9.5%, 10%, 10.5%, 11%, 11.5%, 12%, 12.5%, 13%,13.5%, 14%, 14.5%, 15%, 15.5%, 16%, 16.5%, 17%, 17.5%, 18%, 18.5%, 19%,or 19.5%. In certain instances, a local minimum genomic region size isdetermined according to a limit of detection (LoD) analysis, asdescribed in Example 1, and as provided in FIG. 7 for certain geneticabnormalities.

In some embodiments, genome partitioning comprises determining a localminimum genomic region size (i.e., length). In some embodiments, genomepartitioning comprises determining a local minimum genomic region sizethat is detectable for a sample having a given fetal fraction (e.g., asestimated according to a method described above). In certain instances,a local minimum genomic region size is determined according to a limitof detection (LoD) analysis, as described in Example 1, and as providedin FIG. 7 for certain genetic abnormalities. In certain instances, alocal minimum genomic region size is determined according to aparticular confidence interval of fetal fraction. For example, a localminimum genomic region size can be determined according to an upper 80%confidence interval of fetal fraction. In certain embodiments, a localminimum genomic region size can be determined according to an upper 90%confidence interval of fetal fraction. In certain embodiments, a localminimum genomic region size can be determined according to an upper 95%confidence interval of fetal fraction. In certain embodiments, a localminimum genomic region size can be determined according to an upper 99%confidence interval of fetal fraction.

In certain instances, a minimum genomic region size or a local minimumgenomic region size may span a single portion. To address thispossibility, genome partitioning may further comprise adjusting thenumber of portions for each genomic region such that each regioncomprises at least two portions. Adjusting the number of portions foreach genomic region can generate a refined grid (i.e., refinedrepartitioned genome).

In some embodiments, genome partitioning comprises re-estimating fetalfraction from the refined re-partitioned genomic region. In someembodiments, a re-estimated fetal fraction is compared to an initialfetal fraction estimate for a sample. In some embodiments, are-estimated fetal fraction is compared to a region-specific fetalfraction estimate. In some embodiments, certain method components arerepeated when an initial estimated fetal fraction or region-specificfetal fraction differs from a re-estimated fetal fraction by apredetermined tolerance value. A predetermined tolerance value can bebetween about 1% to about 25%. For example, a predetermined tolerancevalue can be about 2%, 3%, 4%, 5%, 6%, 7%, 8%, 9%, 10%, 11%, 12%, 13%,14%, 15%, 16%, 17%, 18%, 19%, 20%, 21%, 22%, 23% or 24%.

Counts

Sequence reads that are mapped or partitioned based on a selectedfeature or variable can be quantified to determine the number of readsthat are mapped to one or more portions (e.g., portion of a referencegenome), in some embodiments. In certain embodiments the quantity ofsequence reads that are mapped to a portion are termed counts (e.g., acount). Often a count is associated with a portion. In certainembodiments counts for two or more portions (e.g., a set of portions)are mathematically manipulated (e.g., averaged, added, normalized, thelike or a combination thereof). In some embodiments a count isdetermined from some or all of the sequence reads mapped to (i.e.,associated with) a portion. In certain embodiments, a count isdetermined from a pre-defined subset of mapped sequence reads.Pre-defined subsets of mapped sequence reads can be defined or selectedutilizing any suitable feature or variable. In some embodiments,pre-defined subsets of mapped sequence reads can include from 1 to nsequence reads, where n represents a number equal to the sum of allsequence reads generated from a test subject or reference subjectsample.

In certain embodiments a count is derived from sequence reads that areprocessed or manipulated by a suitable method, operation or mathematicalprocess known in the art. A count (e.g., counts) can be determined by asuitable method, operation or mathematical process. In certainembodiments a count is derived from sequence reads associated with aportion where some or all of the sequence reads are weighted, removed,filtered, normalized, adjusted, averaged, derived as a mean, added, orsubtracted or processed by a combination thereof. In some embodiments, acount is derived from raw sequence reads and or filtered sequence reads.In certain embodiments a count value is determined by a mathematicalprocess. In certain embodiments a count value is an average, mean or sumof sequence reads mapped to a portion. Often a count is a mean number ofcounts. In some embodiments, a count is associated with an uncertaintyvalue.

In some embodiments, counts can be manipulated or transformed (e.g.,normalized, combined, added, filtered, selected, averaged, derived as amean, the like, or a combination thereof). In some embodiments, countscan be transformed to produce normalized counts. Counts can be processed(e.g., normalized) by a method known in the art and/or as describedherein (e.g., portion-wise normalization, median count (median bincount, median portion count) normalization, normalization by GC content,linear and nonlinear least squares regression, LOESS (e.g., GC LOESS),LOWESS, ChAI, principal component normalization, RM, GCRM, cQn and/orcombinations thereof). In certain embodiments, counts can be processed(e.g., normalized) by one or more of LOESS, median count (median bincount, median portion count) normalization, and principal componentnormalization. In certain embodiments, counts can be processed (e.g.,normalized) by LOESS followed by median count (median bin count, medianportion count) normalization. In certain embodiments, counts can beprocessed (e.g., normalized) by LOESS followed by median count (medianbin count, median portion count) normalization followed by principalcomponent normalization.

Counts (e.g., raw, filtered and/or normalized counts) can be processedand normalized to one or more levels. Levels and profiles are describedin greater detail hereafter. In certain embodiments counts can beprocessed and/or normalized to a reference level. Reference levels areaddressed later herein. Counts processed according to a level (e.g.,processed counts) can be associated with an uncertainty value (e.g., acalculated variance, an error, standard deviation, Z-score, p-value,mean absolute deviation, etc.). In some embodiments an uncertainty valuedefines a range above and below a level. A value for deviation can beused in place of an uncertainty value, and non-limiting examples ofmeasures of deviation include standard deviation, average absolutedeviation, median absolute deviation, standard score (e.g., Z-score,Z-score, normal score, standardized variable) and the like.

Counts are often obtained from a nucleic acid sample from a pregnantfemale bearing a fetus. Counts of nucleic acid sequence reads mapped toone or more portions often are counts representative of both the fetusand the mother of the fetus (e.g., a pregnant female subject). Incertain embodiments some of the counts mapped to a portion are from afetal genome and some of the counts mapped to the same portion are froma maternal genome.

Data Processing and Normalization

Mapped sequence reads that have been counted are referred to herein asraw data, since the data represents unmanipulated counts (e.g., rawcounts). In some embodiments, sequence read data in a data set can beprocessed further (e.g., mathematically and/or statisticallymanipulated) and/or displayed to facilitate providing an outcome. Incertain embodiments, data sets, including larger data sets, may benefitfrom pre-processing to facilitate further analysis. Pre-processing ofdata sets sometimes involves removal of redundant and/or uninformativeportions or portions of a reference genome (e.g., portions of areference genome with uninformative data, redundant mapped reads,portions with zero median counts, over represented or under representedsequences). Without being limited by theory, data processing and/orpreprocessing may (i) remove noisy data, (ii) remove uninformative data,(iii) remove redundant data, (iv) reduce the complexity of larger datasets, and/or (v) facilitate transformation of the data from one forminto one or more other forms. The terms “pre-processing” and“processing” when utilized with respect to data or data sets arecollectively referred to herein as “processing”. Processing can renderdata more amenable to further analysis, and can generate an outcome insome embodiments. In some embodiments one or more or all processingmethods (e.g., normalization methods, portion filtering, mapping,validation, the like or combinations thereof) are performed by aprocessor, a micro-processor, a computer, in conjunction with memoryand/or by a microprocessor controlled apparatus.

The term “noisy data” as used herein refers to (a) data that has asignificant variance between data points when analyzed or plotted, (b)data that has a significant standard deviation (e.g., greater than 3standard deviations), (c) data that has a significant standard error ofthe mean, the like, and combinations of the foregoing. Noisy datasometimes occurs due to the quantity and/or quality of starting material(e.g., nucleic acid sample), and sometimes occurs as part of processesfor preparing or replicating DNA used to generate sequence reads. Incertain embodiments, noise results from certain sequences being overrepresented when prepared using PCR-based methods. Methods describedherein can reduce or eliminate the contribution of noisy data, andtherefore reduce the effect of noisy data on the provided outcome.

The terms “uninformative data”, “uninformative portions of a referencegenome”, and “uninformative portions” as used herein refer to portions,or data derived therefrom, having a numerical value that issignificantly different from a predetermined threshold value or fallsoutside a predetermined cutoff range of values. The terms “threshold”and “threshold value” herein refer to any number that is calculatedusing a qualifying data set and serves as a limit of diagnosis of agenetic variation (e.g. a copy number variation, an aneuploidy, amicroduplication, a microdeletion, a chromosomal aberration, and thelike). In certain embodiments a threshold is exceeded by resultsobtained by methods described herein and a subject is diagnosed with agenetic variation (e.g. trisomy 21). A threshold value or range ofvalues often is calculated by mathematically and/or statisticallymanipulating sequence read data (e.g., from a reference and/or subject),in some embodiments, and in certain embodiments, sequence read datamanipulated to generate a threshold value or range of values is sequenceread data (e.g., from a reference and/or subject). In some embodiments,an uncertainty value is determined. An uncertainty value generally is ameasure of variance or error and can be any suitable measure of varianceor error. In some embodiments an uncertainty value is a standarddeviation, standard error, calculated variance, p-value, or meanabsolute deviation (MAD). In some embodiments an uncertainty value canbe calculated according to a formula described herein.

Any suitable procedure can be utilized for processing data setsdescribed herein. Non-limiting examples of procedures suitable for usefor processing data sets include filtering, normalizing, weighting,monitoring peak heights, monitoring peak areas, monitoring peak edges,determining area ratios, mathematical processing of data, statisticalprocessing of data, application of statistical algorithms, analysis withfixed variables, analysis with optimized variables, plotting data toidentify patterns or trends for additional processing, the like andcombinations of the foregoing. In some embodiments, data sets areprocessed based on various features (e.g., GC content, redundant mappedreads, centromere regions, telomere regions, the like and combinationsthereof) and/or variables (e.g., fetal gender, maternal age, maternalploidy, percent contribution of fetal nucleic acid, the like orcombinations thereof). In certain embodiments, processing data sets asdescribed herein can reduce the complexity and/or dimensionality oflarge and/or complex data sets. A non-limiting example of a complex dataset includes sequence read data generated from one or more test subjectsand a plurality of reference subjects of different ages and ethnicbackgrounds. In some embodiments, data sets can include from thousandsto millions of sequence reads for each test and/or reference subject.

Data processing can be performed in any number of steps, in certainembodiments. For example, data may be processed using only a singleprocessing procedure in some embodiments, and in certain embodimentsdata may be processed using 1 or more, 5 or more, 10 or more or 20 ormore processing steps (e.g., 1 or more processing steps, 2 or moreprocessing steps, 3 or more processing steps, 4 or more processingsteps, 5 or more processing steps, 6 or more processing steps, 7 or moreprocessing steps, 8 or more processing steps, 9 or more processingsteps, 10 or more processing steps, 11 or more processing steps, 12 ormore processing steps, 13 or more processing steps, 14 or moreprocessing steps, 15 or more processing steps, 16 or more processingsteps, 17 or more processing steps, 18 or more processing steps, 19 ormore processing steps, or 20 or more processing steps). In someembodiments, processing steps may be the same step repeated two or moretimes (e.g., filtering two or more times, normalizing two or moretimes), and in certain embodiments, processing steps may be two or moredifferent processing steps (e.g., filtering, normalizing; normalizing,monitoring peak heights and edges; filtering, normalizing, normalizingto a reference, statistical manipulation to determine p-values, and thelike), carried out simultaneously or sequentially. In some embodiments,any suitable number and/or combination of the same or differentprocessing steps can be utilized to process sequence read data tofacilitate providing an outcome. In certain embodiments, processing datasets by the criteria described herein may reduce the complexity and/ordimensionality of a data set.

In some embodiments, one or more processing steps can comprise one ormore filtering steps. The term “filtering” as used herein refers toremoving portions or portions of a reference genome from consideration.Portions of a reference genome can be selected for removal based on anysuitable criteria, including but not limited to redundant data (e.g.,redundant or overlapping mapped reads), non-informative data (e.g.,portions of a reference genome with zero median counts), portions of areference genome with over represented or under represented sequences,noisy data, the like, or combinations of the foregoing. A filteringprocess often involves removing one or more portions of a referencegenome from consideration and subtracting the counts in the one or moreportions of a reference genome selected for removal from the counted orsummed counts for the portions of a reference genome, chromosome orchromosomes, or genome under consideration. In some embodiments,portions of a reference genome can be removed successively (e.g., one ata time to allow evaluation of the effect of removal of each individualportion), and in certain embodiments all portions of a reference genomemarked for removal can be removed at the same time. In some embodiments,portions of a reference genome characterized by a variance above orbelow a certain level are removed, which sometimes is referred to hereinas filtering “noisy” portions of a reference genome. In certainembodiments, a filtering process comprises obtaining data points from adata set that deviate from the mean profile level of a portion, achromosome, or segment of a chromosome by a predetermined multiple ofthe profile variance, and in certain embodiments, a filtering processcomprises removing data points from a data set that do not deviate fromthe mean profile level of a portion, a chromosome or segment of achromosome by a predetermined multiple of the profile variance. In someembodiments, a filtering process is utilized to reduce the number ofcandidate portions of a reference genome analyzed for the presence orabsence of a genetic variation. Reducing the number of candidateportions of a reference genome analyzed for the presence or absence of agenetic variation (e.g., micro-deletion, micro-duplication) oftenreduces the complexity and/or dimensionality of a data set, andsometimes increases the speed of searching for and/or identifyinggenetic variations and/or genetic aberrations by two or more orders ofmagnitude.

In some embodiments one or more processing steps can comprise one ormore normalization steps. Normalization can be performed by a suitablemethod described herein or known in the art. In certain embodimentsnormalization comprises adjusting values measured on different scales toa notionally common scale. In certain embodiments normalizationcomprises a sophisticated mathematical adjustment to bring probabilitydistributions of adjusted values into alignment. In some embodimentsnormalization comprises aligning distributions to a normal distribution.In certain embodiments normalization comprises mathematical adjustmentsthat allow comparison of corresponding normalized values for differentdatasets in a way that eliminates the effects of certain grossinfluences (e.g., error and anomalies). In certain embodimentsnormalization comprises scaling. Normalization sometimes comprisesdivision of one or more data sets by a predetermined variable orformula. Normalization sometimes comprises subtraction of one or moredata sets by a predetermined variable or formula. Non-limiting examplesof normalization methods include portion-wise normalization,normalization by GC content, median count (median bin count, medianportion count) normalization, linear and nonlinear least squaresregression, LOESS, GC LOESS, LOWESS (locally weighted scatterplotsmoothing), ChAI, principal component normalization, repeat masking(RM), GC-normalization and repeat masking (GCRM), cQn and/orcombinations thereof. In some embodiments, the determination of apresence or absence of a genetic variation (e.g., an aneuploidy, amicroduplication, a microdeletion) utilizes a normalization method(e.g., portion-wise normalization, normalization by GC content, mediancount (median bin count, median portion count) normalization, linear andnonlinear least squares regression, LOESS, GC LOESS, LOWESS (locallyweighted scatterplot smoothing), ChAI, principal componentnormalization, repeat masking (RM), GC-normalization and repeat masking(GCRM), cQn, a normalization method known in the art and/or acombination thereof). In some embodiments, the determination of apresence or absence of a genetic variation (e.g., an aneuploidy, amicroduplication, a microdeletion) utilizes one or more of LOESS, mediancount (median bin count, median portion count) normalization, andprincipal component normalization. In some embodiments, thedetermination of a presence or absence of a genetic variation utilizesLOESS followed by median count (median bin count, median portion count)normalization. In some embodiments, the determination of a presence orabsence of a genetic variation utilizes LOESS followed by median count(median bin count, median portion count) normalization followed byprincipal component normalization.

Any suitable number of normalizations can be used. In some embodiments,data sets can be normalized 1 or more, 5 or more, 10 or more or even 20or more times. Data sets can be normalized to values (e.g., normalizingvalue) representative of any suitable feature or variable (e.g., sampledata, reference data, or both). Non-limiting examples of types of datanormalizations that can be used include normalizing raw count data forone or more selected test or reference portions to the total number ofcounts mapped to the chromosome or the entire genome on which theselected portion or sections are mapped; normalizing raw count data forone or more selected portions to a median reference count for one ormore portions or the chromosome on which a selected portion or segmentsis mapped; normalizing raw count data to previously normalized data orderivatives thereof; and normalizing previously normalized data to oneor more other predetermined normalization variables. Normalizing a dataset sometimes has the effect of isolating statistical error, dependingon the feature or property selected as the predetermined normalizationvariable. Normalizing a data set sometimes also allows comparison ofdata characteristics of data having different scales, by bringing thedata to a common scale (e.g., predetermined normalization variable). Insome embodiments, one or more normalizations to a statistically derivedvalue can be utilized to minimize data differences and diminish theimportance of outlying data. Normalizing portions, or portions of areference genome, with respect to a normalizing value sometimes isreferred to as “portion-wise normalization”.

In certain embodiments, a processing step comprising normalizationincludes normalizing to a static window, and in some embodiments, aprocessing step comprising normalization includes normalizing to amoving or sliding window. The term “window” as used herein refers to oneor more portions chosen for analysis, and sometimes used as a referencefor comparison (e.g., used for normalization and/or other mathematicalor statistical manipulation). The term “normalizing to a static window”as used herein refers to a normalization process using one or moreportions selected for comparison between a test subject and referencesubject data set. In some embodiments the selected portions are utilizedto generate a profile. A static window generally includes apredetermined set of portions that do not change during manipulationsand/or analysis. The terms “normalizing to a moving window” and“normalizing to a sliding window” as used herein refer to normalizationsperformed to portions localized to the genomic region (e.g., immediategenetic surrounding, adjacent portion or sections, and the like) of aselected test portion, where one or more selected test portions arenormalized to portions immediately surrounding the selected testportion. In certain embodiments, the selected portions are utilized togenerate a profile. A sliding or moving window normalization oftenincludes repeatedly moving or sliding to an adjacent test portion, andnormalizing the newly selected test portion to portions immediatelysurrounding or adjacent to the newly selected test portion, whereadjacent windows have one or more portions in common. In certainembodiments, a plurality of selected test portions and/or chromosomescan be analyzed by a sliding window process.

In some embodiments, normalizing to a sliding or moving window cangenerate one or more values, where each value represents normalizationto a different set of reference portions selected from different regionsof a genome (e.g., chromosome). In certain embodiments, the one or morevalues generated are cumulative sums (e.g., a numerical estimate of theintegral of the normalized count profile over the selected portion,domain (e.g., part of chromosome), or chromosome). The values generatedby the sliding or moving window process can be used to generate aprofile and facilitate arriving at an outcome. In some embodiments,cumulative sums of one or more portions can be displayed as a functionof genomic position. Moving or sliding window analysis sometimes is usedto analyze a genome for the presence or absence of micro-deletionsand/or micro-insertions. In certain embodiments, displaying cumulativesums of one or more portions is used to identify the presence or absenceof regions of genetic variation (e.g., micro-deletions,micro-duplications). In some embodiments, moving or sliding windowanalysis is used to identify genomic regions containing micro-deletionsand in certain embodiments, moving or sliding window analysis is used toidentify genomic regions containing micro-duplications.

Described in greater detail hereafter are certain examples ofnormalization processes that can be utilized, such as LOESS, ChAI andprincipal component normalization methods, for example.

In some embodiments, a processing step comprises a weighting. The terms“weighted”, “weighting” or “weight function” or grammatical derivativesor equivalents thereof, as used herein, refer to a mathematicalmanipulation of a portion or all of a data set sometimes utilized toalter the influence of certain data set features or variables withrespect to other data set features or variables (e.g., increase ordecrease the significance and/or contribution of data contained in oneor more portions or portions of a reference genome, based on the qualityor usefulness of the data in the selected portion or portions of areference genome). A weighting function can be used to increase theinfluence of data with a relatively small measurement variance, and/orto decrease the influence of data with a relatively large measurementvariance, in some embodiments. For example, portions of a referencegenome with under represented or low quality sequence data can be “downweighted” to minimize the influence on a data set, whereas selectedportions of a reference genome can be “up weighted” to increase theinfluence on a data set. A non-limiting example of a weighting functionis [1/(standard deviation)]. A weighting step sometimes is performed ina manner substantially similar to a normalizing step. In someembodiments, a data set is divided by a predetermined variable (e.g.,weighting variable). A predetermined variable (e.g., minimized targetfunction, Phi) often is selected to weigh different parts of a data setdifferently (e.g., increase the influence of certain data types whiledecreasing the influence of other data types).

In certain embodiments, a processing step can comprise one or moremathematical and/or statistical manipulations. Any suitable mathematicaland/or statistical manipulation, alone or in combination, may be used toanalyze and/or manipulate a data set described herein. Any suitablenumber of mathematical and/or statistical manipulations can be used. Insome embodiments, a data set can be mathematically and/or statisticallymanipulated 1 or more, 5 or more, 10 or more or 20 or more times.Non-limiting examples of mathematical and statistical manipulations thatcan be used include addition, subtraction, multiplication, division,algebraic functions, least squares estimators, curve fitting,differential equations, rational polynomials, double polynomials,orthogonal polynomials, z-scores, p-values, chi values, phi values,analysis of peak levels, determination of peak edge locations,calculation of peak area ratios, analysis of median chromosomal level,calculation of mean absolute deviation, sum of squared residuals, mean,standard deviation, standard error, the like or combinations thereof. Amathematical and/or statistical manipulation can be performed on all ora portion of sequence read data, or processed products thereof.Non-limiting examples of data set variables or features that can bestatistically manipulated include raw counts, filtered counts,normalized counts, peak heights, peak widths, peak areas, peak edges,lateral tolerances, P-values, median levels, mean levels, countdistribution within a genomic region, relative representation of nucleicacid species, the like or combinations thereof.

In some embodiments, a processing step can comprise the use of one ormore statistical algorithms. Any suitable statistical algorithm, aloneor in combination, may be used to analyze and/or manipulate a data setdescribed herein. Any suitable number of statistical algorithms can beused. In some embodiments, a data set can be analyzed using 1 or more, 5or more, 10 or more or 20 or more statistical algorithms. Non-limitingexamples of statistical algorithms suitable for use with methodsdescribed herein include decision trees, counternulls, multiplecomparisons, omnibus test, Behrens-Fisher problem, bootstrapping,Fisher's method for combining independent tests of significance, nullhypothesis, type I error, type II error, exact test, one-sample Z test,two-sample Z test, one-sample t-test, paired t-test, two-sample pooledt-test having equal variances, two-sample unpooled t-test having unequalvariances, one-proportion z-test, two-proportion z-test pooled,two-proportion z-test unpooled, one-sample chi-square test, two-sample Ftest for equality of variances, confidence interval, credible interval,significance, meta analysis, simple linear regression, robust linearregression, the like or combinations of the foregoing. Non-limitingexamples of data set variables or features that can be analyzed usingstatistical algorithms include raw counts, filtered counts, normalizedcounts, peak heights, peak widths, peak edges, lateral tolerances,P-values, median levels, mean levels, count distribution within agenomic region, relative representation of nucleic acid species, thelike or combinations thereof.

In certain embodiments, a data set can be analyzed by utilizing multiple(e.g., 2 or more) statistical algorithms (e.g., least squaresregression, principle component analysis, linear discriminant analysis,quadratic discriminant analysis, bagging, neural networks, supportvector machine models, random forests, classification tree models,K-nearest neighbors, logistic regression and/or loss smoothing) and/ormathematical and/or statistical manipulations (e.g., referred to hereinas manipulations). The use of multiple manipulations can generate anN-dimensional space that can be used to provide an outcome, in someembodiments. In certain embodiments, analysis of a data set by utilizingmultiple manipulations can reduce the complexity and/or dimensionalityof the data set. For example, the use of multiple manipulations on areference data set can generate an N-dimensional space (e.g.,probability plot) that can be used to represent the presence or absenceof a genetic variation, depending on the genetic status of the referencesamples (e.g., positive or negative for a selected genetic variation).Analysis of test samples using a substantially similar set ofmanipulations can be used to generate an N-dimensional point for each ofthe test samples. The complexity and/or dimensionality of a test subjectdata set sometimes is reduced to a single value or N-dimensional pointthat can be readily compared to the N-dimensional space generated fromthe reference data. Test sample data that fall within the N-dimensionalspace populated by the reference subject data are indicative of agenetic status substantially similar to that of the reference subjects.Test sample data that fall outside of the N-dimensional space populatedby the reference subject data are indicative of a genetic statussubstantially dissimilar to that of the reference subjects. In someembodiments, references are euploid or do not otherwise have a geneticvariation or medical condition.

After data sets have been counted, optionally filtered and normalized,the processed data sets can be further manipulated by one or morefiltering and/or normalizing procedures, in some embodiments. A data setthat has been further manipulated by one or more filtering and/ornormalizing procedures can be used to generate a profile, in certainembodiments. The one or more filtering and/or normalizing proceduressometimes can reduce data set complexity and/or dimensionality, in someembodiments. An outcome can be provided based on a data set of reducedcomplexity and/or dimensionality.

In some embodiments portions may be filtered according to a measure oferror (e.g., standard deviation, standard error, calculated variance,p-value, mean absolute error (MAE), average absolute deviation and/ormean absolute deviation (MAD). In certain embodiments a measure of errorrefers to count variability. In some embodiments portions are filteredaccording to count variability. In certain embodiments count variabilityis a measure of error determined for counts mapped to a portion (i.e.,portion) of a reference genome for multiple samples (e.g., multiplesample obtained from multiple subjects, e.g., 50 or more, 100 or more,500 or more 1000 or more, 5000 or more or 10,000 or more subjects). Insome embodiments portions with a count variability above apre-determined upper range are filtered (e.g., excluded fromconsideration). In some embodiments a pre-determined upper range is aMAD value equal to or greater than about 50, about 52, about 54, about56, about 58, about 60, about 62, about 64, about 66, about 68, about70, about 72, about 74 or equal to or greater than about 76. In someembodiments portions with a count variability below a pre-determinedlower range are filtered (e.g., excluded from consideration). In someembodiments a pre-determined lower range is a MAD value equal to or lessthan about 40, about 35, about 30, about 25, about 20, about 15, about10, about 5, about 1, or equal to or less than about 0. In someembodiments portions with a count variability outside a pre-determinedrange are filtered (e.g., excluded from consideration). In someembodiments a pre-determined range is a MAD value greater than zero andless than about 76, less than about 74, less than about 73, less thanabout 72, less than about 71, less than about 70, less than about 69,less than about 68, less than about 67, less than about 66, less thanabout 65, less than about 64, less than about 62, less than about 60,less than about 58, less than about 56, less than about 54, less thanabout 52 or less than about 50. In some embodiments a pre-determinedrange is a MAD value greater than zero and less than about 67.7. In someembodiments portions with a count variability within a pre-determinedrange are selected (e.g., used for determining the presence or absenceof a genetic variation).

In some embodiments the count variability of portions represents adistribution (e.g., a normal distribution). In some embodiments portionsare selected within a quantile of the distribution. In some embodimentsportions within a quantile equal to or less than about 99.9%, 99.8%,99.7%, 99.6%, 99.5%, 99.4%, 99.3%, 99.2%, 99.1%, 99.0%, 98.9%, 98.8%,98.7%, 98.6%, 98.5%, 98.4%, 98.3%, 98.2%, 98.1%, 98.0%, 97%, 96%, 95%,94%, 93%, 92%, 91%, 90%, 85%, 80%, or equal to or less than a quantileof about 75% for the distribution are selected. In some embodimentsportions within a 99% quantile of the distribution of count variabilityare selected. In some embodiments portions with a MAD>0 and a MAD<67.725a within the 99% quantile and are selected, resulting in theidentification of a set of stable portions of a reference genome.

Portions may be filtered based on, or based on part on, a measure oferror. A measure of error comprising absolute values of deviation, suchas an R-factor, can be used for portion removal or weighting in certainembodiments. An R-factor, in some embodiments, is defined as the sum ofthe absolute deviations of the predicted count values from the actualmeasurements divided by the predicted count values from the actualmeasurements (e.g., Equation II herein). While a measure of errorcomprising absolute values of deviation may be used, a suitable measureof error may be alternatively employed. In certain embodiments, ameasure of error not comprising absolute values of deviation, such as adispersion based on squares, may be utilized. In some embodiments,portions are filtered or weighted according to a measure of mappability(e.g., a mappability score). A portion sometimes is filtered or weightedaccording to a relatively low number of sequence reads mapped to theportion (e.g., 0, 1, 2, 3, 4, 5 reads mapped to the portion). Portionscan be filtered or weighted according to the type of analysis beingperformed. For example, for chromosome 13, 18 and/or 21 aneuploidyanalysis, sex chromosomes may be filtered, and only autosomes, or asubset of autosomes, may be analyzed.

In particular embodiments, the following filtering process may beemployed. The same set of portions (e.g., portions of a referencegenome) within a given chromosome (e.g., chromosome 21) is selected andthe number of reads in affected and unaffected samples are compared. Thegap relates trisomy 21 and euploid samples and it involves a set ofportions covering most of chromosome 21. The set of portions is the samebetween euploid and T21 samples. The distinction between a set ofportions and a single section is not crucial, as a portion can bedefined. The same genomic region is compared in different patients. Thisprocess can be utilized for a trisomy analysis, such as for T13 or T18in addition to, or instead of, T21.

After data sets have been counted, optionally filtered and normalized,the processed data sets can be manipulated by weighting, in someembodiments. One or more portions can be selected for weighting toreduce the influence of data (e.g., noisy data, uninformative data)contained in the selected portions, in certain embodiments, and in someembodiments, one or more portions can be selected for weighting toenhance or augment the influence of data (e.g., data with small measuredvariance) contained in the selected portions. In some embodiments, adata set is weighted utilizing a single weighting function thatdecreases the influence of data with large variances and increases theinfluence of data with small variances. A weighting function sometimesis used to reduce the influence of data with large variances and augmentthe influence of data with small variances (e.g., [1/(standarddeviation)]). In some embodiments, a profile plot of processed datafurther manipulated by weighting is generated to facilitateclassification and/or providing an outcome. An outcome can be providedbased on a profile plot of weighted data

Filtering or weighting of portions can be performed at one or moresuitable points in an analysis. For example, portions may be filtered orweighted before or after sequence reads are mapped to portions of areference genome. Portions may be filtered or weighted before or afteran experimental bias for individual genome portions is determined insome embodiments. In certain embodiments, portions may be filtered orweighted before or after genomic section levels are calculated.

After data sets have been counted, optionally filtered, normalized, andoptionally weighted, the processed data sets can be manipulated by oneor more mathematical and/or statistical (e.g., statistical functions orstatistical algorithm) manipulations, in some embodiments. In certainembodiments, processed data sets can be further manipulated bycalculating Z-scores for one or more selected portions, chromosomes, orportions of chromosomes. In some embodiments, processed data sets can befurther manipulated by calculating P-values. In certain embodiments,mathematical and/or statistical manipulations include one or moreassumptions pertaining to ploidy and/or fetal fraction. In someembodiments, a profile plot of processed data further manipulated by oneor more statistical and/or mathematical manipulations is generated tofacilitate classification and/or providing an outcome. An outcome can beprovided based on a profile plot of statistically and/or mathematicallymanipulated data. An outcome provided based on a profile plot ofstatistically and/or mathematically manipulated data often includes oneor more assumptions pertaining to ploidy and/or fetal fraction.

In certain embodiments, multiple manipulations are performed onprocessed data sets to generate an N-dimensional space and/orN-dimensional point, after data sets have been counted, optionallyfiltered and normalized. An outcome can be provided based on a profileplot of data sets analyzed in N-dimensions.

In some embodiments, data sets are processed utilizing one or more peaklevel analysis, peak width analysis, peak edge location analysis, peaklateral tolerances, the like, derivations thereof, or combinations ofthe foregoing, as part of or after data sets have processed and/ormanipulated. In some embodiments, a profile plot of data processedutilizing one or more peak level analysis, peak width analysis, peakedge location analysis, peak lateral tolerances, the like, derivationsthereof, or combinations of the foregoing is generated to facilitateclassification and/or providing an outcome. An outcome can be providedbased on a profile plot of data that has been processed utilizing one ormore peak level analysis, peak width analysis, peak edge locationanalysis, peak lateral tolerances, the like, derivations thereof, orcombinations of the foregoing.

In some embodiments, the use of one or more reference samples that aresubstantially free of a genetic variation in question can be used togenerate a reference median count profile, which may result in apredetermined value representative of the absence of the geneticvariation, and often deviates from a predetermined value in areascorresponding to the genomic location in which the genetic variation islocated in the test subject, if the test subject possessed the geneticvariation. In test subjects at risk for, or suffering from a medicalcondition associated with a genetic variation, the numerical value forthe selected portion or sections is expected to vary significantly fromthe predetermined value for non-affected genomic locations. In certainembodiments, the use of one or more reference samples known to carry thegenetic variation in question can be used to generate a reference mediancount profile, which may result in a predetermined value representativeof the presence of the genetic variation, and often deviates from apredetermined value in areas corresponding to the genomic location inwhich a test subject does not carry the genetic variation. In testsubjects not at risk for, or suffering from a medical conditionassociated with a genetic variation, the numerical value for theselected portion or sections is expected to vary significantly from thepredetermined value for affected genomic locations.

In some embodiments, analysis and processing of data can include the useof one or more assumptions. A suitable number or type of assumptions canbe utilized to analyze or process a data set. Non-limiting examples ofassumptions that can be used for data processing and/or analysis includematernal ploidy, fetal contribution, prevalence of certain sequences ina reference population, ethnic background, prevalence of a selectedmedical condition in related family members, parallelism between rawcount profiles from different patients and/or runs afterGC-normalization and repeat masking (e.g., GCRM), identical matchesrepresent PCR artifacts (e.g., identical base position), assumptionsinherent in a fetal quantifier assay (e.g., FQA), assumptions regardingtwins (e.g., if 2 twins and only 1 is affected the effective fetalfraction is only 50% of the total measured fetal fraction (similarly fortriplets, quadruplets and the like)), fetal cell free DNA (e.g., cfDNA)uniformly covers the entire genome, the like and combinations thereof.

In those instances where the quality and/or depth of mapped sequencereads does not permit an outcome prediction of the presence or absenceof a genetic variation at a desired confidence level (e.g., 95% orhigher confidence level), based on the normalized count profiles, one ormore additional mathematical manipulation algorithms and/or statisticalprediction algorithms, can be utilized to generate additional numericalvalues useful for data analysis and/or providing an outcome. The term“normalized count profile” as used herein refers to a profile generatedusing normalized counts. Examples of methods that can be used togenerate normalized counts and normalized count profiles are describedherein. As noted, mapped sequence reads that have been counted can benormalized with respect to test sample counts or reference samplecounts. In some embodiments, a normalized count profile can be presentedas a plot.

LOESS Normalization

LOESS is a regression modeling method known in the art that combinesmultiple regression models in a k-nearest-neighbor-based meta-model.LOESS is sometimes referred to as a locally weighted polynomialregression. GC LOESS, in some embodiments, applies an LOESS model to therelationship between fragment count (e.g., sequence reads, counts) andGC composition for portions of a reference genome. Plotting a smoothcurve through a set of data points using LOESS is sometimes called anLOESS curve, particularly when each smoothed value is given by aweighted quadratic least squares regression over the span of values ofthe y-axis scattergram criterion variable. For each point in a data set,the LOESS method fits a low-degree polynomial to a subset of the data,with explanatory variable values near the point whose response is beingestimated. The polynomial is fitted using weighted least squares, givingmore weight to points near the point whose response is being estimatedand less weight to points further away. The value of the regressionfunction for a point is then obtained by evaluating the local polynomialusing the explanatory variable values for that data point. The LOESS fitis sometimes considered complete after regression function values havebeen computed for each of the data points. Many of the details of thismethod, such as the degree of the polynomial model and the weights, areflexible.

ChAI Normalization

Another normalization methodology that can be used to reduce errorassociated with nucleic acid indicators is referred to herein as ChAIand often makes use of a principal component analysis. In certainembodiments, a principal component analysis includes (a) filtering,according to a read density distribution, portions of a referencegenome, thereby providing a read density profile for a test samplecomprising read densities of filtered portions, where the read densitiescomprise sequence reads of circulating cell-free nucleic acid from atest sample from a pregnant female, and the read density distribution isdetermined for read densities of portions for multiple samples, (b)adjusting the read density profile for the test sample according to oneor more principal components, which principal components are obtainedfrom a set of known euploid samples by a principal component analysis,thereby providing a test sample profile comprising adjusted readdensities, and (c) comparing the test sample profile to a referenceprofile, thereby providing a comparison. In some embodiments, aprincipal component analysis includes (d) determining the presence orabsence of a genetic variation for the test sample according to thecomparison.

Filtering Portions

In certain embodiments one or more portions (e.g., portions of a genome)are removed from consideration by a filtering process. In certainembodiments one or more portions are filtered (e.g., subjected to afiltering process) thereby providing filtered portions. In someembodiments a filtering process removes certain portions and retainsportions (e.g., a subset of portions). Following a filtering process,retained portions are often referred to herein as filtered portions. Insome embodiments portions of a reference genome are filtered. In someembodiments portions of a reference genome that are removed by afiltering process are not included in a determination of the presence orabsence of a genetic variation (e.g., a chromosome aneuploidy,microduplication, microdeletion). In some embodiments portionsassociated with read densities (e.g., where a read density is for aportion) are removed by a filtering process and read densitiesassociated with removed portions are not included in a determination ofthe presence or absence of a genetic variation (e.g., a chromosomeaneuploidy, microduplication, microdeletion). In some embodiments a readdensity profile comprises and/or consist of read densities of filteredportions. Portions can be selected, filtered, and/or removed fromconsideration using any suitable criteria and/or method known in the artor described herein. Non-limiting examples of criteria used forfiltering portions include redundant data (e.g., redundant oroverlapping mapped reads), non-informative data (e.g., portions of areference genome with zero mapped counts), portions of a referencegenome with over represented or under represented sequences, GC content,noisy data, mappability, counts, count variability, read density,variability of read density, a measure of uncertainty, a repeatabilitymeasure, the like, or combinations of the foregoing. Portions aresometimes filtered according to a distribution of counts and/or adistribution of read densities. In some embodiments portions arefiltered according to a distribution of counts and/or read densitieswhere the counts and/or read densities are obtained from one or morereference samples. One or more reference samples is sometimes referredto herein as a training set. In some embodiments portions are filteredaccording to a distribution of counts and/or read densities where thecounts and/or read densities are obtained from one or more test samples.In some embodiments portions are filtered according to a measure ofuncertainty for a read density distribution. In certain embodiments,portions that demonstrate a large deviation in read densities areremoved by a filtering process. For example, a distribution of readdensities (e.g., a distribution of average mean, or median readdensities) can be determined, where each read density in thedistribution maps to the same portion. A measure of uncertainty (e.g., aMAD) can be determined by comparing a distribution of read densities formultiple samples where each portion of a genome is associated withmeasure of uncertainty. According to the foregoing example, portions canbe filtered according to a measure of uncertainty (e.g., a standarddeviation (SD), a MAD) associated with each portion and a predeterminedthreshold. A predetermined threshold is indicated by the dashed verticallines enclosing a range of acceptable MAD values. In certain instances,portions comprising MAD values within the acceptable range are retainedand portions comprising MAD values outside of the acceptable range areremoved from consideration by a filtering process. In some embodiments,according to the foregoing example, portions comprising read densitiesvalues (e.g., median, average or mean read densities) outside apre-determined measure of uncertainty are often removed fromconsideration by a filtering process. In some embodiments portionscomprising read densities values (e.g., median, average or mean readdensities) outside an inter-quartile range of a distribution are removedfrom consideration by a filtering process. In some embodiments portionscomprising read densities values outside more than 2 times, 3 times, 4times or 5 times an inter-quartile range of a distribution are removedfrom consideration by a filtering process. In some embodiments portionscomprising read densities values outside more than 2 sigma, 3 sigma, 4sigma, 5 sigma, 6 sigma, 7 sigma or 8 sigma (e.g., where sigma is arange defined by a standard deviation) are removed from consideration bya filtering process.

In some embodiments a system comprises a filtering module. A filteringmodule often accepts, retrieves and/or stores portions (e.g., portionsof pre-determined sizes and/or overlap, portion locations within areference genome) and read densities associated with portions, oftenfrom another suitable module. In some embodiments selected portions(e.g., filtered portions) are provided by a filtering module. In someembodiments, a filtering module is required to provide filtered portionsand/or to remove portions from consideration. In certain embodiments afiltering module removes read densities from consideration where readdensities are associated with removed portions. A filtering module oftenprovides selected portions (e.g., filtered portions) to another suitablemodule.

Bias Estimates

Sequencing technologies can be vulnerable to multiple sources of bias.Sometimes sequencing bias is a local bias (e.g., a local genome bias).Local bias often is manifested at the level of a sequence read. A localgenome bias can be any suitable local bias. Non-limiting examples of alocal bias include sequence bias (e.g., GC bias, AT bias, and the like),bias correlated with DNase I sensitivity, entropy, repetitive sequencebias, chromatin structure bias, polymerase error-rate bias, palindromebias, inverted repeat bias, PCR related bias, the like or combinationsthereof. In some embodiments the source of a local bias is notdetermined or known.

In some embodiments a local genome bias estimate is determined. A localgenome bias estimate is sometimes referred to herein as a local genomebias estimation. A local genome bias estimate can be determined for areference genome, a segment or a portion thereof. In some embodiments alocal genome bias estimate is determined for one or more sequence reads(e.g., some or all sequence reads of a sample). A local genome biasestimate is often determined for a sequence read according to a localgenome bias estimation for a corresponding location and/or position of areference (e.g., a reference genome). In some embodiments a local genomebias estimate comprises a quantitative measure of bias of a sequence(e.g., a sequence read, a sequence of a reference genome). A localgenome bias estimation can be determined by a suitable method ormathematical process. In some embodiments a local genome bias estimateis determined by a suitable distribution and/or a suitable distributionfunction (e.g., a PDF). In some embodiments a local genome bias estimatecomprises a quantitative representation of a PDF. In some embodiments alocal genome bias estimate (e.g., a probability density estimation(PDE), a kernel density estimation) is determined by a probabilitydensity function (e.g., a PDF, e.g., a kernel density function) of alocal bias content. In some embodiments a density estimation comprises akernel density estimation. A local genome bias estimate is sometimesexpressed as an average, mean, or median of a distribution. Sometimes alocal genome bias estimate is expressed as a sum or an integral (e.g.,an area under a curve (AUC) of a suitable distribution.

A PDF (e.g., a kernel density function, e.g., an Epanechnikov kerneldensity function) often comprises a bandwidth variable (e.g., abandwidth). A bandwidth variable often defines the size and/or length ofa window from which a probability density estimate (PDE) is derived whenusing a PDF. A window from which a PDE is derived often comprises adefined length of polynucleotides. In some embodiments a window fromwhich a PDE is derived is a portion. A portion (e.g., a portion size, aportion length) is often determined according to a bandwidth variable. Abandwidth variable determines the length or size of the window used todetermine a local genome bias estimate. a length of a polynucleotidesegment (e.g., a contiguous segment of nucleotide bases) from which alocal genome bias estimate is determined. A PDE (e.g., read density,local genome bias estimate (e.g., a GC density)) can be determined usingany suitable bandwidth, non-limiting examples of which include abandwidth of about 5 bases to about 100,000 bases, about 5 bases toabout 50,000 bases, about 5 bases to about 25,000 bases, about 5 basesto about 10,000 bases, about 5 bases to about 5,000 bases, about 5 basesto about 2,500 bases, about 5 bases to about 1000 bases, about 5 basesto about 500 bases, about 5 bases to about 250 bases, about 20 bases toabout 250 bases, or the like. In some embodiments a local genome biasestimate (e.g., a GC density) is determined using a bandwidth of about400 bases or less, about 350 bases or less, about 300 bases or less,about 250 bases or less, about 225 bases or less, about 200 bases orless, about 175 bases or less, about 150 bases or less, about 125 basesor less, about 100 bases or less, about 75 bases or less, about 50 basesor less or about 25 bases or less. In certain embodiments a local genomebias estimate (e.g., a GC density) is determined using a bandwidthdetermined according to an average, mean, median, or maximum read lengthof sequence reads obtained for a given subject and/or sample. Sometimesa local genome bias estimate (e.g., a GC density) is determined using abandwidth about equal to an average, mean, median, or maximum readlength of sequence reads obtained for a given subject and/or sample. Insome embodiments a local genome bias estimate (e.g., a GC density) isdetermined using a bandwidth of about 250, 240, 230, 220, 210, 200, 190,180, 160, 150, 140, 130, 120, 110, 100, 90, 80, 70, 60, 50, 40, 30, 20or about 10 bases.

A local genome bias estimate can be determined at a single baseresolution, although local genome bias estimates (e.g., local GCcontent) can be determined at a lower resolution. In some embodiments alocal genome bias estimate is determined for a local bias content. Alocal genome bias estimate (e.g., as determined using a PDF) often isdetermined using a window. In some embodiments, a local genome biasestimate comprises use of a window comprising a pre-selected number ofbases. Sometimes a window comprises a segment of contiguous bases.Sometimes a window comprises one or more portions of non-contiguousbases. Sometimes a window comprises one or more portions (e.g., portionsof a genome). A window size or length is often determined by a bandwidthand according to a PDF. In some embodiments a window is about 10 ormore, 8 or more, 7 or more, 6 or more, 5 or more, 4 or more, 3 or more,or about 2 or more times the length of a bandwidth. A window issometimes twice the length of a selected bandwidth when a PDF (e.g., akernel density function) is used to determine a density estimate. Awindow may comprise any suitable number of bases. In some embodiments awindow comprises about 5 bases to about 100,000 bases, about 5 bases toabout 50,000 bases, about 5 bases to about 25,000 bases, about 5 basesto about 10,000 bases, about 5 bases to about 5,000 bases, about 5 basesto about 2,500 bases, about 5 bases to about 1000 bases, about 5 basesto about 500 bases, about 5 bases to about 250 bases, or about 20 basesto about 250 bases. In some embodiments a genome, or segments thereof,is partitioned into a plurality of windows. Windows encompassing regionsof a genome may or may not overlap. In some embodiments windows arepositioned at equal distances from each other. In some embodimentswindows are positioned at different distances from each other. Incertain embodiment a genome, or segment thereof, is partitioned into aplurality of sliding windows, where a window is slid incrementallyacross a genome, or segment thereof, where each window at each incrementcomprises a local genome bias estimate (e.g., a local GC density). Awindow can be slid across a genome at any suitable increment, accordingto any numerical pattern or according to any athematic defined sequence.In some embodiments, for a local genome bias estimate determination, awindow is slid across a genome, or a segment thereof, at an increment ofabout 10,000 bp or more about 5,000 bp or more, about 2,500 bp or more,about 1,000 bp or more, about 750 bp or more, about 500 bp or more,about 400 bases or more, about 250 bp or more, about 100 bp or more,about 50 bp or more, or about 25 bp or more. In some embodiments, for alocal genome bias estimate determination, a window is slid across agenome, or a segment thereof, at an increment of about 25, 24, 23, 22,21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2,or about 1 bp. For example, for a local genome bias estimatedetermination, a window may comprise about 400 bp (e.g., a bandwidth of200 bp) and may be slid across a genome in increments of 1 bp. In someembodiments, a local genome bias estimate is determined for each base ina genome, or segment thereof, using a kernel density function and abandwidth of about 200 bp.

In some embodiments a local genome bias estimate is a local GC contentand/or a representation of local GC content. The term “local” as usedherein (e.g., as used to describe a local bias, local bias estimate,local bias content, local genome bias, local GC content, and the like)refers to a polynucleotide segment of 10,000 bp or less. In someembodiments the term “local” refers to a polynucleotide segment of 5000bp or less, 4000 bp or less, 3000 bp or less, 2000 bp or less, 1000 bpor less, 500 bp or less, 250 bp or less, 200 bp or less, 175 bp or less,150 bp or less, 100 bp or less, 75 bp or less, or 50 bp or less. A localGC content is often a representation (e.g., a mathematical, aquantitative representation) of GC content for a local segment of agenome, sequence read, sequence read assembly (e.g., a contig, aprofile, and the like). For example, a local GC content can be a localGC bias estimate or a GC density.

One or more GC densities are often determined for polynucleotides of areference or sample (e.g., a test sample). In some embodiments a GCdensity is a representation (e.g., a mathematical, a quantitativerepresentation) of local GC content (e.g., for a polynucleotide segmentof 5000 bp or less). In some embodiments a GC density is a local genomebias estimate. A GC density can be determined using a suitable processdescribed herein and/or known in the art. A GC density can be determinedusing a suitable PDF (e.g., a kernel density function (e.g., anEpanechnikov kernel density function). In some embodiments a GC densityis a PDE (e.g., a kernel density estimation). In certain embodiments, aGC density is defined by the presence or absence of one or more guanine(G) and/or cytosine (C) nucleotides. Inversely, in some embodiments, aGC density can be defined by the presence or absence of one or more aadenine (A) and/or thymidine (T) nucleotides. GC densities for local GCcontent, in some embodiments, are normalized according to GC densitiesdetermined for an entire genome, or segment thereof (e.g., autosomes,set of chromosomes, single chromosome, a gene). One or more GC densitiescan be determined for polynucleotides of a sample (e.g., a test sample)or a reference sample. A GC density often is determined for a referencegenome. In some embodiments a GC density is determined for a sequenceread according to a reference genome. A GC density of a read is oftendetermined according to a GC density determined for a correspondinglocation and/or position of a reference genome to which a read ismapped. In some embodiments a GC density determined for a location on areference genome is assigned and/or provided for a read, where the read,or a segment thereof, maps to the same location on the reference genome.Any suitable method can be used to determine a location of a mapped readon a reference genome for the purpose of generating a GC density for aread. In some embodiments a median position of a mapped read determinesa location on a reference genome from which a GC density for the read isdetermined. For example, where the median position of a read maps toChromosome 12 at base number x of a reference genome, the GC density ofthe read is often provided as the GC density determined by a kerneldensity estimation for a position located on Chromosome 12 at or nearbase number x of the reference genome. In some embodiments a GC densityis determined for some or all base positions of a read according to areference genome. Sometimes a GC density of a read comprises an average,sum, median or integral of two or more GC densities determined for aplurality of base positions on a reference genome.

In some embodiments a local genome bias estimation (e.g., a GC density)is quantitated and/or is provided a value. A local genome biasestimation (e.g., a GC density) is sometimes expressed as an average,mean, and/or median. A local genome bias estimation (e.g., a GC density)is sometimes expressed as a maximum peak height of a PDE. Sometimes alocal genome bias estimation (e.g., a GC density) is expressed as a sumor an integral (e.g., an area under a curve (AUC)) of a suitable PDE. Insome embodiments a GC density comprises a kernel weight. In certainembodiments a GC density of a read comprises a value about equal to anaverage, mean, sum, median, maximum peak height or integral of a kernelweight.

Bias Frequencies

Bias frequencies are sometimes determined according to one or more localgenome bias estimates (e.g., GC densities). A bias frequency issometimes a count or sum of the number of occurrences of a local genomebias estimate for a sample, reference (e.g., a reference genome, areference sequence) or part thereof. A bias frequency is sometimes acount or sum of the number of occurrences of a local genome biasestimate (e.g., each local genome bias estimate) for a sample,reference, or part thereof. In some embodiments a bias frequency is a GCdensity frequency. A GC density frequency is often determined accordingto one or more GC densities. For example, a GC density frequency mayrepresent the number of times a GC density of value x is representedover an entire genome, or a segment thereof. A bias frequency is often adistribution of local genome bias estimates, where the number ofoccurrences of each local genome bias estimate is represented as a biasfrequency. Bias frequencies are sometimes mathematically manipulatedand/or normalized. Bias frequencies can be mathematically manipulatedand/or normalized by a suitable method. In some embodiments, biasfrequencies are normalized according to a representation (e.g., afraction, a percentage) of each local genome bias estimate for a sample,reference or part thereof (e.g., autosomes, a subset of chromosomes, asingle chromosome, or reads thereof). Bias frequencies can be determinedfor some or all local genome bias estimates of a sample or reference. Insome embodiments bias frequencies can be determined for local genomebias estimates for some or all sequence reads of a test sample.

In some embodiments a system comprises a bias density module 6. A biasdensity module can accept, retrieve and/or store mapped sequence reads 5and reference sequences 2 in any suitable format and generate localgenome bias estimates, local genome bias distributions, biasfrequencies, GC densities, GC density distributions and/or GC densityfrequencies (collectively represented by box 7). In some embodiments abias density module transfers data and/or information (e.g., 7) toanother suitable module (e.g., a relationship module 8).

Relationships

In some embodiments one or more relationships are generated betweenlocal genome bias estimates and bias frequencies. The term“relationship” as use herein refers to a mathematical and/or a graphicalrelationship between two or more variables or values. A relationship canbe generated by a suitable mathematical and/or graphical process.Non-limiting examples of a relationship include a mathematical and/orgraphical representation of a function, a correlation, a distribution, alinear or non-linear equation, a line, a regression, a fittedregression, the like or a combination thereof. Sometimes a relationshipcomprises a fitted relationship. In some embodiments a fittedrelationship comprises a fitted regression. Sometimes a relationshipcomprises two or more variables or values that are weighted. In someembodiments a relationship comprise a fitted regression where one ormore variables or values of the relationship a weighted. Sometimes aregression is fitted in a weighted fashion. Sometimes a regression isfitted without weighting. In certain embodiments, generating arelationship comprises plotting or graphing.

In some embodiments a suitable relationship is determined between localgenome bias estimates and bias frequencies. In some embodimentsgenerating a relationship between (i) local genome bias estimates and(ii) bias frequencies for a sample provides a sample bias relationship.In some embodiments generating a relationship between (i) local genomebias estimates and (ii) bias frequencies for a reference provides areference bias relationship. In certain embodiments, a relationship isgenerated between GC densities and GC density frequencies. In someembodiments generating a relationship between (i) GC densities and (ii)GC density frequencies for a sample provides a sample GC densityrelationship. In some embodiments generating a relationship between (i)GC densities and (ii) GC density frequencies for a reference provides areference GC density relationship. In some embodiments, where localgenome bias estimates are GC densities, a sample bias relationship is asample GC density relationship and a reference bias relationship is areference GC density relationship. GC densities of a reference GCdensity relationship and/or a sample GC density relationship are oftenrepresentations (e.g., mathematical or quantitative representation) oflocal GC content. In some embodiments a relationship between localgenome bias estimates and bias frequencies comprises a distribution. Insome embodiments a relationship between local genome bias estimates andbias frequencies comprises a fitted relationship (e.g., a fittedregression). In some embodiments a relationship between local genomebias estimates and bias frequencies comprises a fitted linear ornon-linear regression (e.g., a polynomial regression). In certainembodiments a relationship between local genome bias estimates and biasfrequencies comprises a weighted relationship where local genome biasestimates and/or bias frequencies are weighted by a suitable process. Insome embodiments a weighted fitted relationship (e.g., a weightedfitting) can be obtained by a process comprising a quantile regression,parameterized distributions or an empirical distribution withinterpolation. In certain embodiments a relationship between localgenome bias estimates and bias frequencies for a test sample, areference or part thereof, comprises a polynomial regression where localgenome bias estimates are weighted. In some embodiments a weighed fittedmodel comprises weighting values of a distribution. Values of adistribution can be weighted by a suitable process. In some embodiments,values located near tails of a distribution are provided less weightthan values closer to the median of the distribution.

For example, for a distribution between local genome bias estimates(e.g., GC densities) and bias frequencies (e.g., GC densityfrequencies), a weight is determined according to the bias frequency fora given local genome bias estimate, where local genome bias estimatescomprising bias frequencies closer to the mean of a distribution areprovided greater weight than local genome bias estimates comprising biasfrequencies further from the mean.

In some embodiments a system comprises a relationship module 8. Arelationship module can generate relationships as well as functions,coefficients, constants and variables that define a relationship. Arelationship module can accept, store and/or retrieve data and/orinformation (e.g., 7) from a suitable module (e.g., a bias densitymodule 6) and generate a relationship. A relationship module oftengenerates and compares distributions of local genome bias estimates. Arelationship module can compare data sets and sometimes generateregressions and/or fitted relationships. In some embodiments arelationship module compares one or more distributions (e.g.,distributions of local genome bias estimates of samples and/orreferences) and provides weighting factors and/or weighting assignments9 for counts of sequence reads to another suitable module (e.g., a biascorrection module). Sometimes a relationship module provides normalizedcounts of sequence reads directly to a distribution module 21 where thecounts are normalized according to a relationship and/or a comparison.

Generating a Comparison and Use Thereof

In some embodiments a process for reducing local bias in sequence readscomprises normalizing counts of sequence reads. Counts of sequence readsare often normalized according to a comparison of a test sample to areference. For example, sometimes counts of sequence reads arenormalized by comparing local genome bias estimates of sequence reads ofa test sample to local genome bias estimates of a reference (e.g., areference genome, or part thereof). In some embodiments counts ofsequence reads are normalized by comparing bias frequencies of localgenome bias estimates of a test sample to bias frequencies of localgenome bias estimates of a reference. In some embodiments counts ofsequence reads are normalized by comparing a sample bias relationshipand a reference bias relationship, thereby generating a comparison.

Counts of sequence reads are often normalized according to a comparisonof two or more relationships. In certain embodiments two or morerelationships are compared thereby providing a comparison that is usedfor reducing local bias in sequence reads (e.g., normalizing counts).Two or more relationships can be compared by a suitable method. In someembodiments a comparison comprises adding, subtracting, multiplyingand/or dividing a first relationship from a second relationship. Incertain embodiments comparing two or more relationships comprises a useof a suitable linear regression and/or a non-linear regression. Incertain embodiments comparing two or more relationships comprises asuitable polynomial regression (e.g., a 3^(rd) order polynomialregression). In some embodiments a comparison comprises adding,subtracting, multiplying and/or dividing a first regression from asecond regression. In some embodiments two or more relationships arecompared by a process comprising an inferential framework of multipleregressions. In some embodiments two or more relationships are comparedby a process comprising a suitable multivariate analysis. In someembodiments two or more relationships are compared by a processcomprising a basis function (e.g., a blending function, e.g., polynomialbases, Fourier bases, or the like), splines, a radial basis functionand/or wavelets.

In certain embodiments a distribution of local genome bias estimatescomprising bias frequencies for a test sample and a reference iscompared by a process comprising a polynomial regression where localgenome bias estimates are weighted. In some embodiments a polynomialregression is generated between (i) ratios, each of which ratioscomprises bias frequencies of local genome bias estimates of a referenceand bias frequencies of local genome bias estimates of a sample and (ii)local genome bias estimates. In some embodiments a polynomial regressionis generated between (i) a ratio of bias frequencies of local genomebias estimates of a reference to bias frequencies of local genome biasestimates of a sample and (ii) local genome bias estimates. In someembodiments a comparison of a distribution of local genome biasestimates for reads of a test sample and a reference comprisesdetermining a log ratio (e.g., a log 2 ratio) of bias frequencies oflocal genome bias estimates for the reference and the sample. In someembodiments a comparison of a distribution of local genome biasestimates comprises dividing a log ratio (e.g., a log 2 ratio) of biasfrequencies of local genome bias estimates for the reference by a logratio (e.g., a log 2 ratio) of bias frequencies of local genome biasestimates for the sample.

Normalizing counts according to a comparison typically adjusts somecounts and not others. Normalizing counts sometimes adjusts all countsand sometimes does not adjust any counts of sequence reads. A count fora sequence read sometimes is normalized by a process that comprisesdetermining a weighting factor and sometimes the process does notinclude directly generating and utilizing a weighting factor.Normalizing counts according to a comparison sometimes comprisesdetermining a weighting factor for each count of a sequence read. Aweighting factor is often specific to a sequence read and is applied toa count of a specific sequence read. A weighting factor is oftendetermined according to a comparison of two or more bias relationships(e.g., a sample bias relationship compared to a reference biasrelationship). A normalized count is often determined by adjusting acount value according to a weighting factor. Adjusting a count accordingto a weighting factor sometimes includes adding, subtracting,multiplying and/or dividing a count for a sequence read by a weightingfactor. A weighting factor and/or a normalized count sometimes aredetermined from a regression (e.g., a regression line). A normalizedcount is sometimes obtained directly from a regression line (e.g., afitted regression line) resulting from a comparison between biasfrequencies of local genome bias estimates of a reference (e.g., areference genome) and a test sample. In some embodiments each count of aread of a sample is provided a normalized count value according to acomparison of (i) bias frequencies of a local genome bias estimates ofreads compared to (ii) bias frequencies of a local genome bias estimatesof a reference. In certain embodiments, counts of sequence readsobtained for a sample are normalized and bias in the sequence reads isreduced.

Sometimes a system comprises a bias correction module 10. In someembodiments, functions of a bias correction module are performed by arelationship modeling module 8. A bias correction module can accept,retrieve, and/or store mapped sequence reads and weighting factors(e.g., 9) from a suitable module (e.g., a relationship module 8, acompression module 4). In some embodiments a bias correction moduleprovides a count to mapped reads. In some embodiments a bias correctionmodule applies weighting assignments and/or bias correction factors tocounts of sequence reads thereby providing normalized and/or adjustedcounts. A bias correction module often provides normalized counts to aanother suitable module (e.g., a distribution module 21).

In certain embodiments normalizing counts comprises factoring one ormore features in addition to GC density, and normalizing counts of thesequence reads. In certain embodiments normalizing counts comprisesfactoring one or more different local genome bias estimates, andnormalizing counts of the sequence reads. In certain embodiments countsof sequence reads are weighted according to a weighting determinedaccording to one or more features (e.g., one or more biases). In someembodiments counts are normalized according to one or more combinedweights. Sometimes factoring one or more features and/or normalizingcounts according to one or more combined weights are by a processcomprising use of a multivariate model. Any suitable multivariate modelcan be used to normalize counts. Non-limiting examples of a multivariatemodel include a multivariate linear regression, multivariate quantileregression, a multivariate interpolation of empirical data, a non-linearmultivariate model, the like, or a combination thereof.

In some embodiments a system comprises a multivariate correction module13. A multivariate correction module can perform functions of a biasdensity module 6, relationship module 8 and/or a bias correction module10 multiple times thereby adjusting counts for multiple biases. In someembodiments a multivariate correction module comprises one or more biasdensity modules 6, relationship modules 8 and/or bias correction modules10. Sometimes a multivariate correction module provides normalizedcounts 11 to another suitable module (e.g., a distribution module 21).

Weighted Portions

In some embodiments portions are weighted. In some embodiments one ormore portions are weighted thereby providing weighted portions.Weighting portions sometimes removes portion dependencies. Portions canbe weighted by a suitable process. In some embodiments one or moreportions are weighted by an eigen function (e.g., an eigenfunction). Insome embodiments an eigen function comprises replacing portions withorthogonal eigen-portions. In some embodiments a system comprises aportion weighting module 42. In some embodiments a weighting moduleaccepts, retrieves and/or stores read densities, read density profiles,and/or adjusted read density profiles. In some embodiments weightedportions are provided by a portion weighting module. In someembodiments, a weighting module is required to weight portions. Aweighting module can weight portions by one or more weighting methodsknown in the art or described herein. A weighting module often providesweighted portions to another suitable module (e.g., a scoring module 46,a PCA statistics module 33, a profile generation module 26 and thelike).

Principal Component Analysis

In some embodiments a read density profile (e.g., a read density profileof a test sample is adjusted according to a principal component analysis(PCA). A read density profile of one or more reference samples and/or aread density profile of a test subject can be adjusted according to aPCA. Removing bias from a read density profile by a PCA related processis sometimes referred to herein as adjusting a profile. A PCA can beperformed by a suitable PCA method, or a variation thereof. Non-limitingexamples of a PCA method include a canonical correlation analysis (CCA),a Karhunen-Loéve transform (KLT), a Hotelling transform, a properorthogonal decomposition (POD), a singular value decomposition (SVD) ofX, an eigenvalue decomposition (EVD) of XTX, a factor analysis, anEckart-Young theorem, a Schmidt-Mirsky theorem, empirical orthogonalfunctions (EOF), an empirical eigenfunction decomposition, an empiricalcomponent analysis, quasiharmonic modes, a spectral decomposition, anempirical modal analysis, the like, variations or combinations thereof.A PCA often identifies one or more biases in a read density profile. Abias identified by a PCA is sometimes referred to herein as a principalcomponent. In some embodiments one or more biases can be removed byadjusting a read density profile according to one or more principalcomponent using a suitable method. A read density profile can beadjusted by adding, subtracting, multiplying and/or dividing one or moreprincipal components from a read density profile. In some embodimentsone or more biases can be removed from a read density profile bysubtracting one or more principal components from a read densityprofile. Although bias in a read density profile is often identifiedand/or quantitated by a PCA of a profile, principal components are oftensubtracted from a profile at the level of read densities. A PCA oftenidentifies one or more principal components. In some embodiments a PCAidentifies a 1^(st), 2^(nd), 3^(rd), 4^(th), 5^(th), 6^(th), 7^(th),8^(th), 9^(th), and a 10^(th) or more principal components. In certainembodiments 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more principal componentsare used to adjust a profile. Often, principal components are used toadjust a profile in the order of there appearance in a PCA. For example,where three principal components are subtracted from a read densityprofile, a 1^(st), 2^(nd) and 3^(rd) principal component are used.Sometimes a bias identified by a principal component comprises a featureof a profile that is not used to adjust a profile. For example, a PCAmay identify a genetic variation (e.g., an aneuploidy, microduplication,microdeletion, deletion, translocation, insertion) and/or a genderdifference as a principal component. Thus, in some embodiments, one ormore principal components are not used to adjust a profile. For example,sometimes a 1^(st), 2^(nd) and 4^(th) principal component are used toadjust a profile where a 3^(rd) principal component is not used toadjust a profile. A principal component can be obtained from a PCA usingany suitable sample or reference. In some embodiments principalcomponents are obtained from a test sample (e.g., a test subject). Insome embodiments principal components are obtained from one or morereferences (e.g., reference samples, reference sequences, a referenceset). In certain instances, a PCA is performed on a median read densityprofile obtained from a training set comprising multiple samplesresulting in the identification of a 1^(st) principal component and asecond principal component. In some embodiments principal components areobtained from a set of subjects known to be devoid of a geneticvariation in question. In some embodiments principal components areobtained from a set of known euploids. Principal component are oftenidentified according to a PCA performed using one or more read densityprofiles of a reference (e.g., a training set). One or more principalcomponents obtained from a reference are often subtracted from a readdensity profile of a test subject thereby providing an adjusted profile.

In some embodiments a system comprises a PCA statistics module 33. A PCAstatistics module can accepts and/or retrieve read density profiles fromanother suitable module (e.g., a profile generation module 26). A PCA isoften performed by a PCA statistics module. A PCA statistics moduleoften accepts, retrieves and/or stores read density profiles andprocesses read density profiles from a reference set 32, training set 30and/or from one or more test subjects 28. A PCA statistics module cangenerate and/or provide principal components and/or adjust read densityprofiles according to one or more principal components. Adjusted readdensity profiles (e.g., 40, 38) are often provided by a PCA statisticsmodule. A PCA statistics module can provide and/or transfer adjustedread density profiles (e.g., 38, 40) to another suitable module (e.g., aportion weighting module 42, a scoring module 46). In some embodiments aPCA statistics module can provide a gender call 36. A gender call issometimes a determination of fetal gender determined according to a PCAand/or according to one or more principal components. In someembodiments a PCA statistics module comprises some, all or amodification of the R code shown below. An R code for computingprincipal components generally starts with cleaning the data (e.g.,subtracting median, filtering portions, and trimming extreme values):

  #Clean the data outliers for PCA dclean <- (dat − m)[mask,] for (j in1:ncol(dclean)) { q <- quantile(dclean[,j],c(.25,.75)) qmin <- q[1] −4*(q[2] − q[1]) qmax <- q[2] + 4*(q[2] − q[1]) dclean[dclean[,j] <qmin,j] <- qmin dclean[dclean[,j] > qmax,j] <- qmax }

-   -   Then the principal components are computed:    -   #Compute principal components    -   pc<-prcomp(dclean)$x    -   Finally, each sample's PCA-adjusted profile can be computed        with:

  #Compute residuals mm <- model.matrix(~pc[, 1:numpc]) for (j in1:ncol(dclean)) dclean[,j] <- dclean[,j] − predict(lm(dclean[,j]~mm))

Comparing Profiles

In some embodiments, determining an outcome comprises a comparison. Incertain embodiments, a read density profile, or a portion thereof, isutilized to provide an outcome. In some embodiments determining anoutcome (e.g., a determination of the presence or absence of a geneticvariation) comprises a comparison of two or more read density profiles.Comparing read density profiles often comprises comparing read densityprofiles generated for a selected segment of a genome. For example, atest profile is often compared to a reference profile where the test andreference profiles were determined for a segment of a genome (e.g., areference genome) that is substantially the same segment. Comparing readdensity profiles sometimes comprises comparing two or more subsets ofportions of a read density profile. A subset of portions of a readdensity profile may represent a segment of a genome (e.g., a chromosome,or segment thereof). A read density profile can comprise any amount ofsubsets of portions. Sometimes a read density profile comprises two ormore, three or more, four or more, or five or more subsets. In certainembodiments a read density profile comprises two subsets of portionswhere each portion represents segments of a reference genome that areadjacent. In some embodiments a test profile can be compared to areference profile where the test profile and reference profile bothcomprise a first subset of portions and a second subset of portionswhere the first and second subsets represent different segments of agenome. Some subsets of portions of a read density profile may comprisegenetic variations and other subsets of portions are sometimessubstantially free of genetic variations. Sometimes all subsets ofportions of a profile (e.g., a test profile) are substantially free of agenetic variation. Sometimes all subsets of portions of a profile (e.g.,a test profile) comprise a genetic variation. In some embodiments a testprofile can comprise a first subset of portions that comprise a geneticvariation and a second subset of portions that are substantially free ofa genetic variation.

In some embodiments methods described herein comprise preforming acomparison (e.g., comparing a test profile to a reference profile). Twoor more data sets, two or more relationships and/or two or more profilescan be compared by a suitable method. Non-limiting examples ofstatistical methods suitable for comparing data sets, relationshipsand/or profiles include Behrens-Fisher approach, bootstrapping, Fisher'smethod for combining independent tests of significance, Neyman-Pearsontesting, confirmatory data analysis, exploratory data analysis, exacttest, F-test, Z-test, T-test, calculating and/or comparing a measure ofuncertainty, a null hypothesis, counternulls and the like, a chi-squaretest, omnibus test, calculating and/or comparing level of significance(e.g., statistical significance), a meta analysis, a multivariateanalysis, a regression, simple linear regression, robust linearregression, the like or combinations of the foregoing. In certainembodiments comparing two or more data sets, relationships and/orprofiles comprises determining and/or comparing a measure ofuncertainty. A “measure of uncertainty” as used herein refers to ameasure of significance (e.g., statistical significance), a measure oferror, a measure of variance, a measure of confidence, the like or acombination thereof. A measure of uncertainty can be a value (e.g., athreshold) or a range of values (e.g., an interval, a confidenceinterval, a Bayesian confidence interval, a threshold range).Non-limiting examples of a measure of uncertainty include p-values, asuitable measure of deviation (e.g., standard deviation, sigma, absolutedeviation, mean absolute deviation, the like), a suitable measure oferror (e.g., standard error, mean squared error, root mean squarederror, the like), a suitable measure of variance, a suitable standardscore (e.g., standard deviations, cumulative percentages, percentileequivalents, Z-scores, T-scores, R-scores, standard nine (stanine),percent in stanine, the like), the like or combinations thereof. In someembodiments determining the level of significance comprises determininga measure of uncertainty (e.g., a p-value). In certain embodiments, twoor more data sets, relationships and/or profiles can be analyzed and/orcompared by utilizing multiple (e.g., 2 or more) statistical methods(e.g., least squares regression, principle component analysis, lineardiscriminant analysis, quadratic discriminant analysis, bagging, neuralnetworks, support vector machine models, random forests, classificationtree models, K-nearest neighbors, logistic regression and/or losssmoothing) and/or any suitable mathematical and/or statisticalmanipulations (e.g., referred to herein as manipulations).

In certain embodiments comparing two or more read density profilescomprises determining and/or comparing a measure of uncertainty for twoor more read density profiles. Read density profiles and/or associatedmeasures of uncertainty are sometimes compared to facilitateinterpretation of mathematical and/or statistical manipulations of adata set and/or to provide an outcome. A read density profile generatedfor a test subject sometimes is compared to a read density profilegenerated for one or more references (e.g., reference samples, referencesubjects, and the like). In some embodiments an outcome is provided bycomparing a read density profile from a test subject to a read densityprofile from a reference for a chromosome, portions or segments thereof,where a reference read density profile is obtained from a set ofreference subjects known not to possess a genetic variation (e.g., areference). In some embodiments an outcome is provided by comparing aread density profile from a test subject to a read density profile froma reference for a chromosome, portions or segments thereof, where areference read density profile is obtained from a set of referencesubjects known to possess a specific genetic variation (e.g., achromosome aneuploidy, a trisomy, a microduplication, a microdeletion).

In certain embodiments, a read density profile of a test subject iscompared to a predetermined value representative of the absence of agenetic variation, and sometimes deviates from a predetermined value atone or more genomic locations (e.g., portions) corresponding to agenomic location in which a genetic variation is located. For example,in test subjects (e.g., subjects at risk for, or suffering from amedical condition associated with a genetic variation), read densityprofiles are expected to differ significantly from read density profilesof a reference (e.g., a reference sequence, reference subject, referenceset) for selected portions when a test subject comprises a geneticvariation in question. Read density profiles of a test subject are oftensubstantially the same as read density profiles of a reference (e.g., areference sequence, reference subject, reference set) for selectedportions when a test subject does not comprise a genetic variation inquestion. Read density profiles are often compared to a predeterminedthreshold and/or threshold range. The term “threshold” as used hereinrefers to any number that is calculated using a qualifying data set andserves as a limit of diagnosis of a genetic variation (e.g., a copynumber variation, an aneuploidy, a chromosomal aberration, amicroduplication, a microdeletion, and the like). In certain embodimentsa threshold is exceeded by results obtained by methods described hereinand a subject is diagnosed with a genetic variation (e.g., a trisomy).In some embodiments a threshold value or range of values often iscalculated by mathematically and/or statistically manipulating sequenceread data (e.g., from a reference and/or subject). A predeterminedthreshold or threshold range of values indicative of the presence orabsence of a genetic variation can vary while still providing an outcomeuseful for determining the presence or absence of a genetic variation.In certain embodiments, a read density profile comprising normalizedread densities and/or normalized counts is generated to facilitateclassification and/or providing an outcome. An outcome can be providedbased on a plot of a read density profile comprising normalized counts(e.g., using a plot of such a read density profile).

In some embodiments a system comprises a scoring module 46. A scoringmodule can accept, retrieve and/or store read density profiles (e.g.,adjusted, normalized read density profiles) from another suitable module(e.g., a profile generation module 26, a PCA statistics module 33, aportion weighting module 42, and the like). A scoring module can accept,retrieve, store and/or compare two or more read density profiles (e.g.,test profiles, reference profiles, training sets, test subjects). Ascoring module can often provide a score (e.g., a plot, profilestatistics, a comparison (e.g., a difference between two or moreprofiles), a Z-score, a measure of uncertainty, a call zone, a samplecall 50 (e.g., a determination of the presence or absence of a geneticvariation), and/or an outcome). A scoring module can provide a score toan end user and/or to another suitable module (e.g., a display, printer,the like). In some embodiments a scoring module comprises some, all or amodification of the R code shown below which comprises an R function forcomputing Chi-square statistics for a specific test (e.g., High-chr21counts).

-   -   The three parameters are:

x = sample read data (portion x sample) m = median values for portions y= test vector (Ex. False for all portions except True for chr21)getChisqP <- function(x,m,y) { ahigh <- apply(x[!y,],2,function(x)sum((x>m[!y]))) alow <- sum((!y)) − ahigh bhigh <-apply(x[y,],2,function(x) sum((x>m[y]))) blow <- sum(y) − bhigh p <-sapply(1:length(ahigh), function(i) { p <-chisq.test(matrix(c(ahigh[i],alow[i],bhigh[i],blow[i]),2))$p.value/2 if(ahigh[i]/alow[i] > bhigh[i]/blow[i]) p <- max(p, 1 − p) else p <-min(p, 1 − p); p}) return(p)

Hybrid Regression Normalization

In some embodiments a hybrid normalization method is used. In someembodiments a hybrid normalization method reduces bias (e.g., GC bias).A hybrid normalization, in some embodiments, comprises (i) an analysisof a relationship of two variables (e.g., counts and GC content) and(ii) selection and application of a normalization method according tothe analysis. A hybrid normalization, in certain embodiments, comprises(i) a regression (e.g., a regression analysis) and (ii) selection andapplication of a normalization method according to the regression. Insome embodiments counts obtained for a first sample (e.g., a first setof samples) are normalized by a different method than counts obtainedfrom another sample (e.g., a second set of samples). In some embodimentscounts obtained for a first sample (e.g., a first set of samples) arenormalized by a first normalization method and counts obtained from asecond sample (e.g., a second set of samples) are normalized by a secondnormalization method. For example, in certain embodiments a firstnormalization method comprises use of a linear regression and a secondnormalization method comprises use of a non-linear regression (e.g., aLOESS, GC-LOESS, LOWESS regression, LOESS smoothing).

In some embodiments a hybrid normalization method is used to normalizesequence reads mapped to portions of a genome or chromosome (e.g.,counts, mapped counts, mapped reads). In certain embodiments raw countsare normalized and in some embodiments adjusted, weighted, filtered orpreviously normalized counts are normalized by a hybrid normalizationmethod. In certain embodiments, genomic section levels or Z-scores arenormalized. In some embodiments counts mapped to selected portions of agenome or chromosome are normalized by a hybrid normalization approach.Counts can refer to a suitable measure of sequence reads mapped toportions of a genome, non-limiting examples of which include raw counts(e.g., unprocessed counts), normalized counts (e.g., normalized by ChAIor a suitable method), portion levels (e.g., average levels, meanlevels, median levels, or the like), Z-scores, the like, or combinationsthereof. The counts can be raw counts or processed counts from one ormore samples (e.g., a test sample, a sample from a pregnant female). Insome embodiments counts are obtained from one or more samples obtainedfrom one or more subjects.

In some embodiments a normalization method (e.g., the type ofnormalization method) is selected according to a regression (e.g., aregression analysis) and/or a correlation coefficient. A regressionanalysis refers to a statistical technique for estimating a relationshipamong variables (e.g., counts and GC content). In some embodiments aregression is generated according to counts and a measure of GC contentfor each portion of multiple portions of a reference genome. A suitablemeasure of GC content can be used, non-limiting examples of whichinclude a measure of guanine, cytosine, adenine, thymine, purine (GC),or pyrimidine (AT or ATU) content, melting temperature (T_(m)) (e.g.,denaturation temperature, annealing temperature, hybridizationtemperature), a measure of free energy, the like or combinationsthereof. A measure of guanine (G), cytosine (C), adenine (A), thymine(T), purine (GC), or pyrimidine (AT or ATU) content can be expressed asa ratio or a percentage. In some embodiments any suitable ratio orpercentage is used, non-limiting examples of which include GC/AT,GC/total nucleotide, GC/A, GC/T, AT/total nucleotide, AT/GC, AT/G, AT/C,G/A, C/A, G/T, G/A, G/AT, C/T, the like or combinations thereof. In someembodiments a measure of GC content is a ratio or percentage of GC tototal nucleotide content. In some embodiments a measure of GC content isa ratio or percentage of GC to total nucleotide content for sequencereads mapped to a portion of reference genome. In certain embodimentsthe GC content is determined according to and/or from sequence readsmapped to each portion of a reference genome and the sequence reads areobtained from a sample (e.g., a sample obtained from a pregnant female).In some embodiments a measure of GC content is not determined accordingto and/or from sequence reads. In certain embodiments, a measure of GCcontent is determined for one or more samples obtained from one or moresubjects.

In some embodiments generating a regression comprises generating aregression analysis or a correlation analysis. A suitable regression canbe used, non-limiting examples of which include a regression analysis,(e.g., a linear regression analysis), a goodness of fit analysis, aPearson's correlation analysis, a rank correlation, a fraction ofvariance unexplained, Nash-Sutcliffe model efficiency analysis,regression model validation, proportional reduction in loss, root meansquare deviation, the like or a combination thereof. In some embodimentsa regression line is generated. In certain embodiments generating aregression comprises generating a linear regression. In certainembodiments generating a regression comprises generating a non-linearregression (e.g., an LOESS regression, an LOWESS regression).

In some embodiments a regression determines the presence or absence of acorrelation (e.g., a linear correlation), for example between counts anda measure of GC content. In some embodiments a regression (e.g., alinear regression) is generated and a correlation coefficient isdetermined. In some embodiments a suitable correlation coefficient isdetermined, non-limiting examples of which include a coefficient ofdetermination, an R² value, a Pearson's correlation coefficient, or thelike.

In some embodiments goodness of fit is determined for a regression(e.g., a regression analysis, a linear regression). Goodness of fitsometimes is determined by visual or mathematical analysis. Anassessment sometimes includes determining whether the goodness of fit isgreater for a non-linear regression or for a linear regression. In someembodiments a correlation coefficient is a measure of a goodness of fit.In some embodiments an assessment of a goodness of fit for a regressionis determined according to a correlation coefficient and/or acorrelation coefficient cutoff value. In some embodiments an assessmentof a goodness of fit comprises comparing a correlation coefficient to acorrelation coefficient cutoff value. In some embodiments an assessmentof a goodness of fit for a regression is indicative of a linearregression. For example, in certain embodiments, a goodness of fit isgreater for a linear regression than for a non-linear regression and theassessment of the goodness of fit is indicative of a linear regression.In some embodiments an assessment is indicative of a linear regressionand a linear regression is used to normalized the counts. In someembodiments an assessment of a goodness of fit for a regression isindicative of a non-linear regression. For example, in certainembodiments, a goodness of fit is greater for a non-linear regressionthan for a linear regression and the assessment of the goodness of fitis indicative of a non-linear regression. In some embodiments anassessment is indicative of a non-linear regression and a non-linearregression is used to normalized the counts.

In some embodiments an assessment of a goodness of fit is indicative ofa linear regression when a correlation coefficient is equal to orgreater than a correlation coefficient cutoff. In some embodiments anassessment of a goodness of fit is indicative of a non-linear regressionwhen a correlation coefficient is less than a correlation coefficientcutoff. In some embodiments a correlation coefficient cutoff ispre-determined. In some embodiments a correlation coefficient cut-off isabout 0.5 or greater, about 0.55 or greater, about 0.6 or greater, about0.65 or greater, about 0.7 or greater, about 0.75 or greater, about 0.8or greater or about 0.85 or greater.

For example, in certain embodiments, a normalization method comprising alinear regression is used when a correlation coefficient is equal to orgreater than about 0.6. In certain embodiments, counts of a sample(e.g., counts per portion of a reference genome, counts per portion) arenormalized according to a linear regression when a correlationcoefficient is equal to or greater than a correlation coefficientcut-off of 0.6, otherwise the counts are normalized according to anon-linear regression (e.g., when the coefficient is less than acorrelation coefficient cut-off of 0.6). In some embodiments anormalization process comprises generating a linear regression ornon-linear regression for the (i) the counts and (ii) the GC content,for each portion of multiple portions of a reference genome. In certainembodiments, a normalization method comprising a non-linear regression(e.g., a LOWESS, a LOESS) is used when a correlation coefficient is lessthan a correlation coefficient cut-off of 0.6. In some embodiments anormalization method comprising a non-linear regression (e.g., a LOWESS)is used when a correlation coefficient (e.g., a correlation coefficient)is less than a correlation coefficient cut-off of about 0.7, less thanabout 0.65, less than about 0.6, less than about 0.55 or less than about0.5. For example, in some embodiments a normalization method comprisinga non-linear regression (e.g., a LOWESS, a LOESS) is used when acorrelation coefficient is less than a correlation coefficient cut-offof about 0.6.

In some embodiments a specific type of regression is selected (e.g., alinear or non-linear regression) and, after the regression is generated,counts are normalized by subtracting the regression from the counts. Insome embodiments subtracting a regression from the counts providesnormalized counts with reduced bias (e.g., GC bias). In some embodimentsa linear regression is subtracted from the counts. In some embodiments anon-linear regression (e.g., a LOESS, GC-LOESS, LOWESS regression) issubtracted from the counts. Any suitable method can be used to subtracta regression line from the counts. For example, if counts x are derivedfrom portion i (e.g., a portion i) comprising a GC content of 0.5 and aregression line determines counts y at a GC content of 0.5, thenx−y=normalized counts for portion i. In some embodiments counts arenormalized prior to and/or after subtracting a regression. In someembodiments, counts normalized by a hybrid normalization approach areused to generate genomic section levels, Z-cores, levels and/or profilesof a genome or a segment thereof. In certain embodiments, countsnormalized by a hybrid normalization approach are analyzed by methodsdescribed herein to determine the presence or absence of a geneticvariation (e.g., in a fetus).

In some embodiments a hybrid normalization method comprises filtering orweighting one or more portions before or after normalization. A suitablemethod of filtering portions, including methods of filtering portions(e.g., portions of a reference genome) described herein can be used. Insome embodiments, portions (e.g., portions of a reference genome) arefiltered prior to applying a hybrid normalization method. In someembodiments, only counts of sequencing reads mapped to selected portions(e.g., portions selected according to count variability) are normalizedby a hybrid normalization. In some embodiments counts of sequencingreads mapped to filtered portions of a reference genome (e.g., portionsfiltered according to count variability) are removed prior to utilizinga hybrid normalization method. In some embodiments a hybridnormalization method comprises selecting or filtering portions (e.g.,portions of a reference genome) according to a suitable method (e.g., amethod described herein). In some embodiments a hybrid normalizationmethod comprises selecting or filtering portions (e.g., portions of areference genome) according to an uncertainty value for counts mapped toeach of the portions for multiple test samples. In some embodiments ahybrid normalization method comprises selecting or filtering portions(e.g., portions of a reference genome) according to count variability.In some embodiments a hybrid normalization method comprises selecting orfiltering portions (e.g., portions of a reference genome) according toGC content, repetitive elements, repetitive sequences, introns, exons,the like or a combination thereof.

For example, in some embodiments multiple samples from multiple pregnantfemale subjects are analyzed and a subset of portions (e.g., portions ofa reference genome) are selected according to count variability. Incertain embodiments a linear regression is used to determine acorrelation coefficient for (i) counts and (ii) GC content, for each ofthe selected portions for a sample obtained from a pregnant femalesubject. In some embodiments a correlation coefficient is determinedthat is greater than a pre-determined correlation cutoff value (e.g., ofabout 0.6), an assessment of the goodness of fit is indicative of alinear regression and the counts are normalized by subtracting thelinear regression from the counts. In certain embodiments a correlationcoefficient is determined that is less than a pre-determined correlationcutoff value (e.g., of about 0.6), an assessment of the goodness of fitis indicative of a non-linear regression, an LOESS regression isgenerated and the counts are normalized by subtracting the LOESSregression from the counts.

Profiles

In some embodiments, a processing step can comprise generating one ormore profiles (e.g., profile plot) from various aspects of a data set orderivation thereof (e.g., product of one or more mathematical and/orstatistical data processing steps known in the art and/or describedherein). The term “profile” as used herein refers to a product of amathematical and/or statistical manipulation of data that can facilitateidentification of patterns and/or correlations in large quantities ofdata. A “profile” often includes values resulting from one or moremanipulations of data or data sets, based on one or more criteria. Aprofile often includes multiple data points. Any suitable number of datapoints may be included in a profile depending on the nature and/orcomplexity of a data set. In certain embodiments, profiles may include 2or more data points, 3 or more data points, 5 or more data points, 10 ormore data points, 24 or more data points, 25 or more data points, 50 ormore data points, 100 or more data points, 500 or more data points, 1000or more data points, 5000 or more data points, 10,000 or more datapoints, or 100,000 or more data points.

In some embodiments, a profile is representative of the entirety of adata set, and in certain embodiments, a profile is representative of apart or subset of a data set. That is, a profile sometimes includes oris generated from data points representative of data that has not beenfiltered to remove any data, and sometimes a profile includes or isgenerated from data points representative of data that has been filteredto remove unwanted data. In some embodiments, a data point in a profilerepresents the results of data manipulation for a portion. In certainembodiments, a data point in a profile includes results of datamanipulation for groups of portions. In some embodiments, groups ofportions may be adjacent to one another, and in certain embodiments,groups of portions may be from different parts of a chromosome orgenome.

Data points in a profile derived from a data set can be representativeof any suitable data categorization. Non-limiting examples of categoriesinto which data can be grouped to generate profile data points include:portions based on size, portions based on sequence features (e.g., GCcontent, AT content, position on a chromosome (e.g., short arm, longarm, centromere, telomere), and the like), levels of expression,chromosome, the like or combinations thereof. In some embodiments, aprofile may be generated from data points obtained from another profile(e.g., normalized data profile renormalized to a different normalizingvalue to generate a renormalized data profile). In certain embodiments,a profile generated from data points obtained from another profilereduces the number of data points and/or complexity of the data set.Reducing the number of data points and/or complexity of a data set oftenfacilitates interpretation of data and/or facilitates providing anoutcome.

A profile (e.g., a genomic profile, a chromosome profile, a profile of asegment of a chromosome) often is a collection of normalized ornon-normalized counts for two or more portions. A profile often includesat least one level (e.g., a genomic section level), and often comprisestwo or more levels (e.g., a profile often has multiple levels). A levelgenerally is for a set of portions having about the same counts ornormalized counts. Levels are described in greater detail herein. Incertain embodiments, a profile comprises one or more portions, whichportions can be weighted, removed, filtered, normalized, adjusted,averaged, derived as a mean, added, subtracted, processed or transformedby any combination thereof. A profile often comprises normalized countsmapped to portions defining two or more levels, where the counts arefurther normalized according to one of the levels by a suitable method.Often counts of a profile (e.g., a profile level) are associated with anuncertainty value.

A profile comprising one or more levels is sometimes padded (e.g., holepadding). Padding (e.g., hole padding) refers to a process ofidentifying and adjusting levels in a profile that are due to maternalmicrodeletions or maternal duplications (e.g., copy number variations).In some embodiments levels are padded that are due to fetalmicroduplications or fetal microdeletions. Microduplications ormicrodeletions in a profile can, in some embodiments, artificially raiseor lower the overall level of a profile (e.g., a profile of achromosome) leading to false positive or false negative determinationsof a chromosome aneuploidy (e.g., a trisomy). In some embodiments levelsin a profile that are due to microduplications and/or deletions areidentified and adjusted (e.g., padded and/or removed) by a processsometimes referred to as padding or hole padding. In certain embodimentsa profile comprises one or more first levels that are significantlydifferent than a second level within the profile, each of the one ormore first levels comprise a maternal copy number variation, fetal copynumber variation, or a maternal copy number variation and a fetal copynumber variation and one or more of the first levels are adjusted.

A profile comprising one or more levels can include a first level and asecond level. In some embodiments a first level is different (e.g.,significantly different) than a second level. In some embodiments afirst level comprises a first set of portions, a second level comprisesa second set of portions and the first set of portions is not a subsetof the second set of portions. In certain embodiments, a first set ofportions is different than a second set of portions from which a firstand second level are determined. In some embodiments a profile can havemultiple first levels that are different (e.g., significantly different,e.g., have a significantly different value) than a second level withinthe profile. In some embodiments a profile comprises one or more firstlevels that are significantly different than a second level within theprofile and one or more of the first levels are adjusted. In someembodiments a profile comprises one or more first levels that aresignificantly different than a second level within the profile, each ofthe one or more first levels comprise a maternal copy number variation,fetal copy number variation, or a maternal copy number variation and afetal copy number variation and one or more of the first levels areadjusted. In some embodiments a first level within a profile is removedfrom the profile or adjusted (e.g., padded). A profile can comprisemultiple levels that include one or more first levels significantlydifferent than one or more second levels and often the majority oflevels in a profile are second levels, which second levels are aboutequal to one another. In some embodiments greater than 50%, greater than60%, greater than 70%, greater than 80%, greater than 90% or greaterthan 95% of the levels in a profile are second levels.

A profile sometimes is displayed as a plot. For example, one or morelevels representing counts (e.g., normalized counts) of portions can beplotted and visualized. Non-limiting examples of profile plots that canbe generated include raw count (e.g., raw count profile or raw profile),normalized count, portion-weighted, z-score, p-value, area ratio versusfitted ploidy, median level versus ratio between fitted and measuredfetal fraction, principle components, the like, or combinations thereof.Profile plots allow visualization of the manipulated data, in someembodiments. In certain embodiments, a profile plot can be utilized toprovide an outcome (e.g., area ratio versus fitted ploidy, median levelversus ratio between fitted and measured fetal fraction, principlecomponents). The terms “raw count profile plot” or “raw profile plot” asused herein refer to a plot of counts in each portion in a regionnormalized to total counts in a region (e.g., genome, portion,chromosome, chromosome portions of a reference genome or a segment of achromosome). In some embodiments, a profile can be generated using astatic window process, and in certain embodiments, a profile can begenerated using a sliding window process.

A profile generated for a test subject sometimes is compared to aprofile generated for one or more reference subjects, to facilitateinterpretation of mathematical and/or statistical manipulations of adata set and/or to provide an outcome. In some embodiments, a profile isgenerated based on one or more starting assumptions (e.g., maternalcontribution of nucleic acid (e.g., maternal fraction), fetalcontribution of nucleic acid (e.g., fetal fraction), ploidy of referencesample, the like or combinations thereof). In certain embodiments, atest profile often centers around a predetermined value representativeof the absence of a genetic variation, and often deviates from apredetermined value in areas corresponding to the genomic location inwhich the genetic variation is located in the test subject, if the testsubject possessed the genetic variation. In test subjects at risk for,or suffering from a medical condition associated with a geneticvariation, the numerical value for a selected portion is expected tovary significantly from the predetermined value for non-affected genomiclocations. Depending on starting assumptions (e.g., fixed ploidy oroptimized ploidy, fixed fetal fraction or optimized fetal fraction orcombinations thereof) the predetermined threshold or cutoff value orthreshold range of values indicative of the presence or absence of agenetic variation can vary while still providing an outcome useful fordetermining the presence or absence of a genetic variation. In someembodiments, a profile is indicative of and/or representative of aphenotype.

By way of a non-limiting example, normalized sample and/or referencecount profiles can be obtained from raw sequence read data by (a)calculating reference median counts for selected chromosomes, portionsor segments thereof from a set of references known not to carry agenetic variation, (b) removal of uninformative portions from thereference sample raw counts (e.g., filtering); (c) normalizing thereference counts for all remaining portions of a reference genome to thetotal residual number of counts (e.g., sum of remaining counts afterremoval of uninformative portions of a reference genome) for thereference sample selected chromosome or selected genomic location,thereby generating a normalized reference subject profile; (d) removingthe corresponding portions from the test subject sample; and (e)normalizing the remaining test subject counts for one or more selectedgenomic locations to the sum of the residual reference median counts forthe chromosome or chromosomes containing the selected genomic locations,thereby generating a normalized test subject profile. In certainembodiments, an additional normalizing step with respect to the entiregenome, reduced by the filtered portions in (b), can be included between(c) and (d).

A data set profile can be generated by one or more manipulations ofcounted mapped sequence read data. Some embodiments include thefollowing. Sequence reads are mapped and the number of sequence tagsmapping to each genomic portion are determined (e.g., counted). A rawcount profile is generated from the mapped sequence reads that arecounted. An outcome is provided by comparing a raw count profile from atest subject to a reference median count profile for chromosomes,portions or segments thereof from a set of reference subjects known notto possess a genetic variation, in certain embodiments.

In some embodiments, sequence read data is optionally filtered to removenoisy data or uninformative portions. After filtering, the remainingcounts typically are summed to generate a filtered data set. A filteredcount profile is generated from a filtered data set, in certainembodiments.

After sequence read data have been counted and optionally filtered, datasets can be normalized to generate levels or profiles. A data set can benormalized by normalizing one or more selected portions to a suitablenormalizing reference value. In some embodiments, a normalizingreference value is representative of the total counts for the chromosomeor chromosomes from which portions are selected. In certain embodiments,a normalizing reference value is representative of one or morecorresponding portions, portions of chromosomes or chromosomes from areference data set prepared from a set of reference subjects known notto possess a genetic variation. In some embodiments, a normalizingreference value is representative of one or more corresponding portions,portions of chromosomes or chromosomes from a test subject data setprepared from a test subject being analyzed for the presence or absenceof a genetic variation. In certain embodiments, the normalizing processis performed utilizing a static window approach, and in some embodimentsthe normalizing process is performed utilizing a moving or slidingwindow approach. In certain embodiments, a profile comprising normalizedcounts is generated to facilitate classification and/or providing anoutcome. An outcome can be provided based on a plot of a profilecomprising normalized counts (e.g., using a plot of such a profile).

Levels

In some embodiments, a value (e.g., a number, a quantitative value) isascribed to a level. A level can be determined by a suitable method,operation or mathematical process (e.g., a processed level). A leveloften is, or is derived from, counts (e.g., normalized counts) for a setof portions. In some embodiments a level of a portion is substantiallyequal to the total number of counts mapped to a portion (e.g., counts,normalized counts). Often a level is determined from counts that areprocessed, transformed or manipulated by a suitable method, operation ormathematical process known in the art. In some embodiments a level isderived from counts that are processed and non-limiting examples ofprocessed counts include weighted, removed, filtered, normalized,adjusted, averaged, derived as a mean (e.g., mean level), added,subtracted, transformed counts or combination thereof. In someembodiments a level comprises counts that are normalized (e.g.,normalized counts of portions). A level can be for counts normalized bya suitable process, non-limiting examples of which include portion-wisenormalization, normalization by GC content, median count normalization,linear and nonlinear least squares regression, LOESS (e.g., GC LOESS),LOWESS, ChAI, principal component normalization, RM, GCRM, cQn, the likeand/or combinations thereof. A level can comprise normalized counts orrelative amounts of counts. In some embodiments a level is for counts ornormalized counts of two or more portions that are averaged and thelevel is referred to as an average level. In some embodiments a level isfor a set of portions having a mean count or mean of normalized countswhich is referred to as a mean level. In some embodiments a level isderived for portions that comprise raw and/or filtered counts. In someembodiments, a level is based on counts that are raw. In someembodiments a level is associated with an uncertainty value (e.g., astandard deviation, a MAD). In some embodiments a level is representedby a Z-score or p-value. A level for one or more portions is synonymouswith a “genomic section level” herein.

A level for one or more portions is synonymous with a “genomic sectionlevel” herein. The term “level” as used herein is sometimes synonymouswith the term “elevation”. A determination of the meaning of the term“level” can be determined from the context in which it is used. Forexample, the term “level”, when used in the context of genomic sections,profiles, reads and/or counts often means an elevation. The term“level”, when used in the context of a substance or composition (e.g.,level of RNA, plexing level) often refers to an amount. The term“level”, when used in the context of uncertainty (e.g., level of error,level of confidence, level of deviation, level of uncertainty) oftenrefers to an amount.

Normalized or non-normalized counts for two or more levels (e.g., two ormore levels in a profile) can sometimes be mathematically manipulated(e.g., added, multiplied, averaged, normalized, the like or combinationthereof) according to levels. For example, normalized or non-normalizedcounts for two or more levels can be normalized according to one, someor all of the levels in a profile. In some embodiments normalized ornon-normalized counts of all levels in a profile are normalizedaccording to one level in the profile. In some embodiments normalized ornon-normalized counts of a first level in a profile are normalizedaccording to normalized or non-normalized counts of a second level inthe profile.

Non-limiting examples of a level (e.g., a first level, a second level)are a level for a set of portions comprising processed counts, a levelfor a set of portions comprising a mean, median or average of counts, alevel for a set of portions comprising normalized counts, the like orany combination thereof. In some embodiments, a first level and a secondlevel in a profile are derived from counts of portions mapped to thesame chromosome. In some embodiments, a first level and a second levelin a profile are derived from counts of portions mapped to differentchromosomes.

In some embodiments a level is determined from normalized ornon-normalized counts mapped to one or more portions. In someembodiments, a level is determined from normalized or non-normalizedcounts mapped to two or more portions, where the normalized counts foreach portion often are about the same. There can be variation in counts(e.g., normalized counts) in a set of portions for a level. In a set ofportions for a level there can be one or more portions having countsthat are significantly different than in other portions of the set(e.g., peaks and/or dips). Any suitable number of normalized ornon-normalized counts associated with any suitable number of portionscan define a level.

In some embodiments one or more levels can be determined from normalizedor non-normalized counts of all or some of the portions of a genome.Often a level can be determined from all or some of the normalized ornon-normalized counts of a chromosome, or segment thereof. In someembodiments, two or more counts derived from two or more portions (e.g.,a set of portions) determine a level. In some embodiments two or morecounts (e.g., counts from two or more portions) determine a level. Insome embodiments, counts from 2 to about 100,000 portions determine alevel. In some embodiments, counts from 2 to about 50,000, 2 to about40,000, 2 to about 30,000, 2 to about 20,000, 2 to about 10,000, 2 toabout 5000, 2 to about 2500, 2 to about 1250, 2 to about 1000, 2 toabout 500, 2 to about 250, 2 to about 100 or 2 to about 60 portionsdetermine a level. In some embodiments counts from about 10 to about 50portions determine a level. In some embodiments counts from about 20 toabout 40 or more portions determine a level.

In some embodiments, a level comprises counts from about 2, 3, 4, 5, 6,7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 45, 50, 55,60 or more portions. In some embodiments, a level corresponds to a setof portions (e.g., a set of portions of a reference genome, a set ofportions of a chromosome or a set of portions of a segment of achromosome).

In some embodiments, a level is determined for normalized ornon-normalized counts of portions that are contiguous. In someembodiments portions (e.g., a set of portions) that are contiguousrepresent neighboring segments of a genome or neighboring segments of achromosome or gene. For example, two or more contiguous portions, whenaligned by merging the portions end to end, can represent a sequenceassembly of a DNA sequence longer than each portion. For example two ormore contiguous portions can represent of an intact genome, chromosome,gene, intron, exon or segment thereof. In some embodiments a level isdetermined from a collection (e.g., a set) of contiguous portions and/ornon-contiguous portions.

Outcome

Methods described herein can provide a determination of the presence orabsence of a genetic variation (e.g., fetal aneuploidy) for a sample,thereby providing an outcome (e.g., thereby providing an outcomedeterminative of the presence or absence of a genetic variation (e.g.,fetal aneuploidy)). A genetic variation often includes a gain, a lossand/or alteration (e.g., duplication, deletion, fusion, insertion,mutation, reorganization, substitution or aberrant methylation) ofgenetic information (e.g., chromosomes, segments of chromosomes,polymorphic regions, translocated regions, altered nucleotide sequence,the like or combinations of the foregoing) that results in a detectablechange in the genome or genetic information of a test subject withrespect to a reference. Presence or absence of a genetic variation canbe determined by transforming, analyzing and/or manipulating sequencereads that have been mapped to portions (e.g., counts, counts of genomicportions of a reference genome). Determining an outcome, in someembodiments, comprises analyzing nucleic acid from a pregnant female. Incertain embodiments, an outcome is determined according to counts (e.g.,normalized counts) obtained from a pregnant female where the counts arefrom nucleic acid obtained from the pregnant female.

Methods described herein sometimes determine presence or absence of afetal aneuploidy (e.g., full chromosome aneuploidy, partial chromosomeaneuploidy or segmental chromosomal aberration (e.g., mosaicism,deletion and/or insertion)) for a test sample from a pregnant femalebearing a fetus. In certain embodiments methods described herein detecteuploidy or lack of euploidy (non-euploidy) for a sample from a pregnantfemale bearing a fetus. Methods described herein sometimes detecttrisomy for one or more chromosomes (e.g., chromosome 13, chromosome 18,chromosome 21 or combination thereof) or segment thereof.

In some embodiments, presence or absence of a genetic variation (e.g., afetal aneuploidy) is determined by a method described herein, by amethod known in the art or by a combination thereof. Presence or absenceof a genetic variation generally is determined from counts of sequencereads mapped to portions of a reference genome. Counts of sequence readsutilized to determine presence or absence of a genetic variationsometimes are raw counts and/or filtered counts, and often arenormalized counts. A suitable normalization process or processes can beused to generate normalized counts, non-limiting examples of whichinclude portion-wise normalization, normalization by GC content, linearand nonlinear least squares regression, LOESS, GC LOESS, LOWESS, ChAI,RM, GCRM and combinations thereof. Normalized counts sometimes areexpressed as one or more levels or levels in a profile for a particularset or sets of portions. Normalized counts sometimes are adjusted orpadded prior to determining presence or absence of a genetic variation.

In some embodiments an outcome is determined according to one or morelevels. In some embodiments, a determination of the presence or absenceof a genetic variation (e.g., a chromosome aneuploidy) is determinedaccording to one or more adjusted levels. In some embodiments adetermination of the presence or absence of a genetic variation (e.g., achromosome aneuploidy) is determined according to a profile comprising 1to about 10,000 adjusted levels. Often a determination of the presenceor absence of a genetic variation (e.g., a chromosome aneuploidy) isdetermined according to a profile comprising about 1 to about a 1000, 1to about 900, 1 to about 800, 1 to about 700, 1 to about 600, 1 to about500, 1 to about 400, 1 to about 300, 1 to about 200, 1 to about 100, 1to about 50, 1 to about 25, 1 to about 20, 1 to about 15, 1 to about 10,or 1 to about 5 adjustments. In some embodiments a determination of thepresence or absence of a genetic variation (e.g., a chromosomeaneuploidy) is determined according to a profile comprising about 1adjustment (e.g., one adjusted level). In some embodiments an outcome isdetermined according to one or more profiles (e.g., a profile of achromosome or segment thereof) comprising one or more, 2 or more, 3 ormore, 5 or more, 6 or more, 7 or more, 8 or more, 9 or more or sometimes10 or more adjustments. In some embodiments, a determination of thepresence or absence of a genetic variation (e.g., a chromosomeaneuploidy) is determined according to a profile where some levels in aprofile are not adjusted. In some embodiments, a determination of thepresence or absence of a genetic variation (e.g., a chromosomeaneuploidy) is determined according to a profile where adjustments arenot made.

In some embodiments, an adjustment of a level (e.g., a first level) in aprofile reduces a false determination or false outcome. In someembodiments, an adjustment of a level (e.g., a first level) in a profilereduces the frequency and/or probability (e.g., statistical probability,likelihood) of a false determination or false outcome. A falsedetermination or outcome can be a determination or outcome that is notaccurate. A false determination or outcome can be a determination oroutcome that is not reflective of the actual or true genetic make-up orthe actual or true genetic disposition (e.g., the presence or absence ofa genetic variation) of a subject (e.g., a pregnant female, a fetusand/or a combination thereof). In some embodiments a false determinationor outcome is a false negative determination. In some embodiments anegative determination or negative outcome is the absence of a geneticvariation (e.g., aneuploidy, copy number variation). In some embodimentsa false determination or false outcome is a false positive determinationor false positive outcome. In some embodiments a positive determinationor positive outcome is the presence of a genetic variation (e.g.,aneuploidy, copy number variation). In some embodiments, a determinationor outcome is utilized in a diagnosis. In some embodiments, adetermination or outcome is for a fetus.

Presence or absence of a genetic variation (e.g., fetal aneuploidy)sometimes is determined without comparing counts for a set of portionsto a reference. Counts measured for a test sample and are in a testregion (e.g., a set of portions of interest) are referred to as “testcounts” herein. Test counts sometimes are processed counts, averaged orsummed counts, a representation, normalized counts, or one or morelevels or levels as described herein. In certain embodiments test countsare averaged or summed (e.g., an average, mean, median, mode or sum iscalculated) for a set of portions, and the averaged or summed counts arecompared to a threshold or range. Test counts sometimes are expressed asa representation, which can be expressed as a ratio or percentage ofcounts for a first set of portions to counts for a second set ofportions. In certain embodiments the first set of portions is for one ormore test chromosomes (e.g., chromosome 13, chromosome 18, chromosome21, or combination thereof) and sometimes the second set of portions isfor the genome or a part of the genome (e.g., autosomes or autosomes andsex chromosomes). In certain embodiments a representation is compared toa threshold or range. In certain embodiments test counts are expressedas one or more levels or levels for normalized counts over a set ofportions, and the one or more levels or levels are compared to athreshold or range. Test counts (e.g., averaged or summed counts,representation, normalized counts, one or more levels or levels) aboveor below a particular threshold, in a particular range or outside aparticular range sometimes are determinative of the presence of agenetic variation or lack of euploidy (e.g., not euploidy). Test counts(e.g., averaged or summed counts, representation, normalized counts, oneor more levels or levels) below or above a particular threshold, in aparticular range or outside a particular range sometimes aredeterminative of the absence of a genetic variation or euploidy.

Presence or absence of a genetic variation (e.g., fetal aneuploidy)sometimes is determined by comparing counts, non-limiting examples ofwhich include test counts, reference counts, raw counts, filteredcounts, averaged or summed counts, representations (e.g., chromosomerepresentations), normalized counts, one or more levels or levels (e.g.,for a set of portions, e.g., genomic section levels, profiles),Z-scores, the like or combinations thereof. In some embodiments testcounts are compared to a reference (e.g., reference counts). A reference(e.g., a reference count) can be a suitable determination of counts,non-limiting examples of which include raw counts, filtered counts,averaged or summed counts, representations (e.g., chromosomerepresentations), normalized counts, one or more levels or levels (e.g.,for a set of portions, e.g., genomic section levels, profiles),Z-scores, the like or combinations thereof. Reference counts often arecounts for a euploid test region or from a segment of a genome orchromosome that is euploid. In some embodiments reference counts andtest counts are obtained from the same sample and/or the same subject.In some embodiments reference counts are from different samples and/orfrom different subjects. In some embodiments reference counts aredetermined from and/or compared to a corresponding segment of the genomefrom which the test counts are derived and/or determined. Acorresponding segment refers to a segment, portion or set of portionsthat map to the same location of a reference genome. In some embodimentsreference counts are determined from and/or compared to a differentsegment of the genome from which the test counts are derived and/ordetermined.

In certain embodiments, test counts sometimes are for a first set ofportions and a reference includes counts for a second set of portionsdifferent than the first set of portions. Reference counts sometimes arefor a nucleic acid sample from the same pregnant female from which thetest sample is obtained. In certain embodiments reference counts are fora nucleic acid sample from one or more pregnant females different thanthe female from which the test sample was obtained. In some embodiments,a first set of portions is in chromosome 13, chromosome 18, chromosome21, a segment thereof or combination of the foregoing, and the secondset of portions is in another chromosome or chromosomes or segmentthereof. In a non-limiting example, where a first set of portions is inchromosome 21 or segment thereof, a second set of portions often is inanother chromosome (e.g., chromosome 1, chromosome 13, chromosome 14,chromosome 18, chromosome 19, segment thereof or combination of theforegoing). A reference often is located in a chromosome or segmentthereof that is typically euploid. For example, chromosome 1 andchromosome 19 often are euploid in fetuses owing to a high rate of earlyfetal mortality associated with chromosome 1 and chromosome 19aneuploidies. A measure of deviation between the test counts and thereference counts can be generated.

In certain embodiments a reference comprises counts for the same set ofportions as for the test counts, where the counts for the reference arefrom one or more reference samples (e.g., often multiple referencesamples from multiple reference subjects). A reference sample often isfrom one or more pregnant females different than a female from which atest sample is obtained. A measure of deviation (e.g., a measure ofuncertainty, an uncertainty value) between the test counts and thereference counts can be generated. In some embodiments a measure ofdeviation is determined from the test counts. In some embodiments ameasure of deviation is determined from the reference counts. In someembodiments a measure of deviation is determined from an entire profileor a subset of portions within a profile.

A suitable measure of deviation can be selected, non-limiting examplesof which include standard deviation, average absolute deviation, medianabsolute deviation, maximum absolute deviation, standard score (e.g.,z-value, z-score, normal score, standardized variable) and the like. Insome embodiments, reference samples are euploid for a test region anddeviation between the test counts and the reference counts is assessed.In some embodiments a determination of the presence or absence of agenetic variation is according to the number of deviations (e.g.,measures of deviations, MAD) between test counts and reference countsfor a segment or portion of a genome or chromosome. In some embodimentsthe presence of a genetic variation is determined when the number ofdeviations between test counts and reference counts is greater thanabout 1, greater than about 1.5, greater than about 2, greater thanabout 2.5, greater than about 2.6, greater than about 2.7, greater thanabout 2.8, greater than about 2.9, greater than about 3, greater thanabout 3.1, greater than about 3.2, greater than about 3.3, greater thanabout 3.4, greater than about 3.5, greater than about 4, greater thanabout 5, or greater than about 6. For example, sometimes a test countdiffers from a reference count by more than 3 measures of deviation(e.g., 3 sigma, 3 MAD) and the presence of a genetic variation isdetermined. In some embodiments a test count obtained from a pregnantfemale is larger than a reference count by more than 3 measures ofdeviation (e.g., 3 sigma, 3 MAD) and the presence of a fetal chromosomeaneuploidy (e.g., a fetal trisomy) is determined. A deviation of greaterthan three between test counts and reference counts often is indicativeof a non-euploid test region (e.g., presence of a genetic variation).Test counts significantly above reference counts, which reference countsare indicative of euploidy, sometimes are determinative of a trisomy. Insome embodiments a test count obtained from a pregnant female is lessthan a reference count by more than 3 measures of deviation (e.g., 3sigma, 3 MAD) and the presence of a fetal chromosome aneuploidy (e.g., afetal monosomy) is determined. Test counts significantly below referencecounts, which reference counts are indicative of euploidy, sometimes aredeterminative of a monosomy.

In some embodiments the absence of a genetic variation is determinedwhen the number of deviations between test counts and reference countsis less than about 3.5, less than about 3.4, less than about 3.3, lessthan about 3.2, less than about 3.1, less than about 3.0, less thanabout 2.9, less than about 2.8, less than about 2.7, less than about2.6, less than about 2.5, less than about 2.0, less than about 1.5, orless than about 1.0. For example, sometimes a test count differs from areference count by less than 3 measures of deviation (e.g., 3 sigma, 3MAD) and the absence of a genetic variation is determined. In someembodiments a test count obtained from a pregnant female differs from areference count by less than 3 measures of deviation (e.g., 3 sigma, 3MAD) and the absence of a fetal chromosome aneuploidy (e.g., a fetaleuploid) is determined. In some embodiments (e.g., deviation of lessthan three between test counts and reference counts (e.g., 3-sigma forstandard deviation) often is indicative of a euploid test region (e.g.,absence of a genetic variation). A measure of deviation between testcounts for a test sample and reference counts for one or more referencesubjects can be plotted and visualized (e.g., z-score plot).

Any other suitable reference can be factored with test counts fordetermining presence or absence of a genetic variation (or determinationof euploid or non-euploid) for a test region of a test sample. Forexample, a fetal fraction determination can be factored with test countsto determine the presence or absence of a genetic variation. A suitableprocess for quantifying fetal fraction can be utilized, non-limitingexamples of which include a mass spectrometric process, sequencingprocess or combination thereof.

In some embodiments the presence or absence of a fetal chromosomalaneuploidy (e.g., a trisomy) is determined, in part, from a fetal ploidydetermination. In some embodiments a fetal ploidy is determined by asuitable method described herein. In some certain embodiments a fetalploidy determination of about 1.20 or greater, 1.25 or greater, 1.30 orgreater, about 1.35 or greater, about 1.4 or greater, or about 1.45 orgreater indicates the presence of a fetal chromosome aneuploidy (e.g.,the presence of a fetal trisomy). In some embodiments a fetal ploidydetermination of about 1.20 to about 2.0, about 1.20 to about 1.9, about1.20 to about 1.85, about 1.20 to about 1.8, about 1.25 to about 2.0,about 1.25 to about 1.9, about 1.25 to about 1.85, about 1.25 to about1.8, about 1.3 to about 2.0, about 1.3 to about 1.9, about 1.3 to about1.85, about 1.3 to about 1.8, about 1.35 to about 2.0, about 1.35 toabout 1.9, about 1.35 to about 1.8, about 1.4 to about 2.0, about 1.4 toabout 1.85 or about 1.4 to about 1.8 indicates the presence of a fetalchromosome aneuploidy (e.g., the presence of a fetal trisomy). In someembodiments the fetal aneuploidy is a trisomy. In some embodiments thefetal aneuploidy is a trisomy of chromosome 13, 18 and/or 21.

In some embodiments a fetal ploidy of less than about 1.35, less thanabout 1.30, less than about 1.25, less than about 1.20 or less thanabout 1.15 indicates the absence of a fetal aneuploidy (e.g., theabsence of a fetal trisomy, e.g., euploid). In some embodiments a fetalploidy determination of about 0.7 to about 1.35, about 0.7 to about1.30, about 0.7 to about 1.25, about 0.7 to about 1.20, about 0.7 toabout 1.15, about 0.75 to about 1.35, about 0.75 to about 1.30, about0.75 to about 1.25, about 0.75 to about 1.20, about 0.75 to about 1.15,about 0.8 to about 1.35, about 0.8 to about 1.30, about 0.8 to about1.25, about 0.8 to about 1.20, or about 0.8 to about 1.15 indicates theabsence of a fetal chromosome aneuploidy (e.g., the absence of a fetaltrisomy, e.g., euploid).

In some embodiments a fetal ploidy of less than about 0.8, less thanabout 0.75, less than about 0.70 or less than about 0.6 indicates thepresence of a fetal aneuploidy (e.g., the presence of a chromosomedeletion). In some embodiments a fetal ploidy determination of about 0to about 0.8, about 0 to about 0.75, about 0 to about 0.70, about 0 toabout 0.65, about 0 to about 0.60, about 0.1 to about 0.8, about 0.1 toabout 0.75, about 0.1 to about 0.70, about 0.1 to about 0.65, about 0.1to about 0.60, about 0.2 to about 0.8, about 0.2 to about 0.75, about0.2 to about 0.70, about 0.2 to about 0.65, about 0.2 to about 0.60,about 0.25 to about 0.8, about 0.25 to about 0.75, about 0.25 to about0.70, about 0.25 to about 0.65, about 0.25 to about 0.60, about 0.3 toabout 0.8, about 0.3 to about 0.75, about 0.3 to about 0.70, about 0.3to about 0.65, about 0.3 to about 0.60 indicates the presence of a fetalchromosome aneuploidy (e.g., the presence of a chromosome deletion). Insome embodiments the fetal aneuploidy determined is a whole chromosomedeletion.

In some embodiments a determination of the presence or absence of afetal aneuploidy (e.g., according to one or more of the ranges of aploidy determination above) is determined according to a call zone. Incertain embodiments a call is made (e.g., a call determining thepresence or absence of a genetic variation, e.g., an outcome) when avalue (e.g. a ploidy value, a fetal fraction value, a level ofuncertainty) or collection of values falls within a pre-defined range(e.g., a zone, a call zone). In some embodiments a call zone is definedaccording to a collection of values that are obtained from the samepatient sample. In certain embodiments a call zone is defined accordingto a collection of values that are derived from the same chromosome orsegment thereof. In some embodiments a call zone based on a ploidydetermination is defined according a level of confidence (e.g., highlevel of confidence, e.g., low level of uncertainty) and/or a fetalfraction. In some embodiments a call zone is defined according to aploidy determination and a fetal fraction of about 2.0% or greater,about 2.5% or greater, about 3% or greater, about 3.25% or greater,about 3.5% or greater, about 3.75% or greater, or about 4.0% or greater.For example, in some embodiments a call is made that a fetus comprises atrisomy 21 based on a ploidy determination of greater than 1.25 with afetal fraction determination of 2% or greater or 4% or greater for asample obtained from a pregnant female bearing a fetus. In certainembodiments, for example, a call is made that a fetus is euploid basedon a ploidy determination of less than 1.25 with a fetal fractiondetermination of 2% or greater or 4% or greater for a sample obtainedfrom a pregnant female bearing a fetus. In some embodiments a call zoneis defined by a confidence level of about 99% or greater, about 99.1% orgreater, about 99.2% or greater, about 99.3% or greater, about 99.4% orgreater, about 99.5% or greater, about 99.6% or greater, about 99.7% orgreater, about 99.8% or greater or about 99.9% or greater. In someembodiments a call is made without using a call zone. In someembodiments a call is made using a call zone and additional data orinformation. In some embodiments a call is made based on a ploidy valuewithout the use of a call zone. In some embodiments a call is madewithout calculating a ploidy value. In some embodiments a call is madebased on visual inspection of a profile (e.g., visual inspection ofgenomic section levels). A call can be made by any suitable method basedin full, or in part, upon determinations, values and/or data obtained bymethods described herein, non-limiting examples of which include a fetalploidy determination, a fetal fraction determination, maternal ploidy,uncertainty and/or confidence determinations, portion levels, levels,profiles, z-scores, expected chromosome representations, measuredchromosome representations, counts (e.g., normalized counts, rawcounts), fetal or maternal copy number variations (e.g., categorizedcopy number variations), significantly different levels, adjusted levels(e.g., padding), the like or combinations thereof.

In some embodiments a no-call zone is where a call is not made. In someembodiments a no-call zone is defined by a value or collection of valuesthat indicate low accuracy, high risk, high error, low level ofconfidence, high level of uncertainty, the like or a combinationthereof. In some embodiments a no-call zone is defined, in part, by afetal fraction of about 5% or less, about 4% or less, about 3% or less,about 2.5% or less, about 2.0% or less, about 1.5% or less or about 1.0%or less.

A genetic variation sometimes is associated with medical condition. Anoutcome determinative of a genetic variation is sometimes an outcomedeterminative of the presence or absence of a condition (e.g., a medicalcondition), disease, syndrome or abnormality, or includes, detection ofa condition, disease, syndrome or abnormality (e.g., non-limitingexamples listed in Table 1). In certain embodiments a diagnosiscomprises assessment of an outcome. An outcome determinative of thepresence or absence of a condition (e.g., a medical condition), disease,syndrome or abnormality by methods described herein can sometimes beindependently verified by further testing (e.g., by karyotyping and/oramniocentesis). Analysis and processing of data can provide one or moreoutcomes. The term “outcome” as used herein can refer to a result ofdata processing that facilitates determining the presence or absence ofa genetic variation (e.g., an aneuploidy, a copy number variation). Incertain embodiments the term “outcome” as used herein refers to aconclusion that predicts and/or determines the presence or absence of agenetic variation (e.g., an aneuploidy, a copy number variation). Incertain embodiments the term “outcome” as used herein refers to aconclusion that predicts and/or determines a risk or probability of thepresence or absence of a genetic variation (e.g., an aneuploidy, a copynumber variation) in a subject (e.g., a fetus). A diagnosis sometimescomprises use of an outcome. For example, a health practitioner mayanalyze an outcome and provide a diagnosis bases on, or based in parton, the outcome. In some embodiments, determination, detection ordiagnosis of a condition, syndrome or abnormality (e.g., listed inTable 1) comprises use of an outcome determinative of the presence orabsence of a genetic variation. In some embodiments, an outcome based oncounted mapped sequence reads or transformations thereof isdeterminative of the presence or absence of a genetic variation. Incertain embodiments, an outcome generated utilizing one or more methods(e.g., data processing methods) described herein is determinative of thepresence or absence of one or more conditions, syndromes orabnormalities listed in Table 1. In certain embodiments a diagnosiscomprises a determination of a presence or absence of a condition,syndrome or abnormality. Often a diagnosis comprises a determination ofa genetic variation as the nature and/or cause of a condition, syndromeor abnormality. In certain embodiments an outcome is not a diagnosis. Anoutcome often comprises one or more numerical values generated using aprocessing method described herein in the context of one or moreconsiderations of probability. A consideration of risk or probabilitycan include, but is not limited to: an uncertainty value, a measure ofvariability, confidence level, sensitivity, specificity, standarddeviation, coefficient of variation (CV) and/or confidence level,Z-scores, Chi values, Phi values, ploidy values, fitted fetal fraction,area ratios, median level, the like or combinations thereof. Aconsideration of probability can facilitate determining whether asubject is at risk of having, or has, a genetic variation, and anoutcome determinative of a presence or absence of a genetic disorderoften includes such a consideration.

An outcome sometimes is a phenotype. An outcome sometimes is a phenotypewith an associated level of confidence (e.g., an uncertainty value,e.g., a fetus is positive for trisomy 21 with a confidence level of 99%,a test subject is negative for a cancer associated with a geneticvariation at a confidence level of 95%). Different methods of generatingoutcome values sometimes can produce different types of results.Generally, there are four types of possible scores or calls that can bemade based on outcome values generated using methods described herein:true positive, false positive, true negative and false negative. Theterms “score”, “scores”, “call” and “calls” as used herein refer tocalculating the probability that a particular genetic variation ispresent or absent in a subject/sample. The value of a score may be usedto determine, for example, a variation, difference, or ratio of mappedsequence reads that may correspond to a genetic variation. For example,calculating a positive score for a selected genetic variation or portionfrom a data set, with respect to a reference genome can lead to anidentification of the presence or absence of a genetic variation, whichgenetic variation sometimes is associated with a medical condition(e.g., cancer, preeclampsia, trisomy, monosomy, and the like). In someembodiments, an outcome comprises a level, a profile and/or a plot(e.g., a profile plot). In those embodiments in which an outcomecomprises a profile, a suitable profile or combination of profiles canbe used for an outcome. Non-limiting examples of profiles that can beused for an outcome include z-score profiles, p-value profiles, chivalue profiles, phi value profiles, the like, and combinations thereof.

An outcome generated for determining the presence or absence of agenetic variation sometimes includes a null result (e.g., a data pointbetween two clusters, a numerical value with a standard deviation thatencompasses values for both the presence and absence of a geneticvariation, a data set with a profile plot that is not similar to profileplots for subjects having or free from the genetic variation beinginvestigated). In some embodiments, an outcome indicative of a nullresult still is a determinative result, and the determination caninclude the need for additional information and/or a repeat of the datageneration and/or analysis for determining the presence or absence of agenetic variation.

An outcome can be generated after performing one or more processingsteps described herein, in some embodiments. In certain embodiments, anoutcome is generated as a result of one of the processing stepsdescribed herein, and in some embodiments, an outcome can be generatedafter each statistical and/or mathematical manipulation of a data set isperformed. An outcome pertaining to the determination of the presence orabsence of a genetic variation can be expressed in a suitable form,which form comprises without limitation, a probability (e.g., oddsratio, p-value), likelihood, value in or out of a cluster, value over orunder a threshold value, value within a range (e.g., a threshold range),value with a measure of variance or confidence, or risk factor,associated with the presence or absence of a genetic variation for asubject or sample. In certain embodiments, comparison between samplesallows confirmation of sample identity (e.g., allows identification ofrepeated samples and/or samples that have been mixed up (e.g.,mislabeled, combined, and the like)).

In some embodiments, an outcome comprises a value above or below apredetermined threshold or cutoff value (e.g., greater than 1, less than1), and an uncertainty or confidence level associated with the value. Incertain embodiments a predetermined threshold or cutoff value is anexpected level or an expected level range. An outcome also can describean assumption used in data processing. In certain embodiments, anoutcome comprises a value that falls within or outside a predeterminedrange of values (e.g., a threshold range) and the associated uncertaintyor confidence level for that value being inside or outside the range. Insome embodiments, an outcome comprises a value that is equal to apredetermined value (e.g., equal to 1, equal to zero), or is equal to avalue within a predetermined value range, and its associated uncertaintyor confidence level for that value being equal or within or outside arange. An outcome sometimes is graphically represented as a plot (e.g.,profile plot).

As noted above, an outcome can be characterized as a true positive, truenegative, false positive or false negative. The term “true positive” asused herein refers to a subject correctly diagnosed as having a geneticvariation. The term “false positive” as used herein refers to a subjectwrongly identified as having a genetic variation. The term “truenegative” as used herein refers to a subject correctly identified as nothaving a genetic variation. The term “false negative” as used hereinrefers to a subject wrongly identified as not having a geneticvariation. Two measures of performance for any given method can becalculated based on the ratios of these occurrences: (i) a sensitivityvalue, which generally is the fraction of predicted positives that arecorrectly identified as being positives; and (ii) a specificity value,which generally is the fraction of predicted negatives correctlyidentified as being negative.

In certain embodiments, one or more of sensitivity, specificity and/orconfidence level are expressed as a percentage. In some embodiments, thepercentage, independently for each variable, is greater than about 90%(e.g., about 90, 91, 92, 93, 94, 95, 96, 97, 98 or 99%, or greater than99% (e.g., about 99.5%, or greater, about 99.9% or greater, about 99.95%or greater, about 99.99% or greater)). Coefficient of variation (CV) insome embodiments is expressed as a percentage, and sometimes thepercentage is about 10% or less (e.g., about 10, 9, 8, 7, 6, 5, 4, 3, 2or 1%, or less than 1% (e.g., about 0.5% or less, about 0.1% or less,about 0.05% or less, about 0.01% or less)). A probability (e.g., that aparticular outcome is not due to chance) in certain embodiments isexpressed as a Z-score, a p-value, or the results of a t-test. In someembodiments, a measured variance, confidence interval, sensitivity,specificity and the like (e.g., referred to collectively as confidenceparameters) for an outcome can be generated using one or more dataprocessing manipulations described herein.

The term “sensitivity” as used herein refers to the number of truepositives divided by the number of true positives plus the number offalse negatives, where sensitivity (sens) may be within the range of0≤sens≤1. The term “specificity” as used herein refers to the number oftrue negatives divided by the number of true negatives plus the numberof false positives, where sensitivity (spec) may be within the range of0≤spec≤1. In some embodiments a method that has sensitivity andspecificity equal to one, or 100%, or near one (e.g., between about 90%to about 99%) sometimes is selected. In some embodiments, a methodhaving a sensitivity equaling 1, or 100% is selected, and in certainembodiments, a method having a sensitivity near 1 is selected (e.g., asensitivity of about 90%, a sensitivity of about 91%, a sensitivity ofabout 92%, a sensitivity of about 93%, a sensitivity of about 94%, asensitivity of about 95%, a sensitivity of about 96%, a sensitivity ofabout 97%, a sensitivity of about 98%, or a sensitivity of about 99%).In some embodiments, a method having a specificity equaling 1, or 100%is selected, and in certain embodiments, a method having a specificitynear 1 is selected (e.g., a specificity of about 90%, a specificity ofabout 91%, a specificity of about 92%, a specificity of about 93%, aspecificity of about 94%, a specificity of about 95%, a specificity ofabout 96%, a specificity of about 97%, a specificity of about 98%, or aspecificity of about 99%).

In some embodiments, presence or absence of a genetic variation (e.g.,chromosome aneuploidy) is determined for a fetus. In such embodiments,presence or absence of a fetal genetic variation (e.g., fetal chromosomeaneuploidy) is determined.

In certain embodiments, presence or absence of a genetic variation(e.g., chromosome aneuploidy) is determined for a sample. In suchembodiments, presence or absence of a genetic variation in samplenucleic acid (e.g., chromosome aneuploidy) is determined. In someembodiments, a variation detected or not detected resides in samplenucleic acid from one source but not in sample nucleic acid from anothersource. Non-limiting examples of sources include placental nucleic acid,fetal nucleic acid, maternal nucleic acid, cancer cell nucleic acid,non-cancer cell nucleic acid, the like and combinations thereof. Innon-limiting examples, a particular genetic variation detected or notdetected (i) resides in placental nucleic acid but not in fetal nucleicacid and not in maternal nucleic acid; (ii) resides in fetal nucleicacid but not maternal nucleic acid; or (iii) resides in maternal nucleicacid but not fetal nucleic acid.

After one or more outcomes have been generated, an outcome often is usedto provide a determination of the presence or absence of a geneticvariation and/or associated medical condition. An outcome typically isprovided to a health care professional (e.g., laboratory technician ormanager; physician or assistant). Often an outcome is provided by anoutcome module. In certain embodiments an outcome is provided by aplotting module. In certain embodiments an outcome is provided on aperipheral or component of an apparatus. For example, sometimes anoutcome is provided by a printer or display. In some embodiments, anoutcome determinative of the presence or absence of a genetic variationis provided to a healthcare professional in the form of a report, and incertain embodiments the report comprises a display of an outcome valueand an associated confidence parameter. Generally, an outcome can bedisplayed in a suitable format that facilitates determination of thepresence or absence of a genetic variation and/or medical condition.Non-limiting examples of formats suitable for use for reporting and/ordisplaying data sets or reporting an outcome include digital data, agraph, a 2D graph, a 3D graph, and 4D graph, a picture, a pictograph, achart, a bar graph, a pie graph, a diagram, a flow chart, a scatterplot, a map, a histogram, a density chart, a function graph, a circuitdiagram, a block diagram, a bubble map, a constellation diagram, acontour diagram, a cartogram, spider chart, Venn diagram, nomogram, andthe like, and combination of the foregoing. Various examples of outcomerepresentations are shown in the drawings and are described in theExamples.

Generating an outcome can be viewed as a transformation of nucleic acidsequence read data, or the like, into a representation of a subject'scellular nucleic acid, in certain embodiments. For example, analyzingsequence reads of nucleic acid from a subject and generating achromosome profile and/or outcome can be viewed as a transformation ofrelatively small sequence read fragments to a representation ofrelatively large chromosome structure. In some embodiments, an outcomeresults from a transformation of sequence reads from a subject (e.g., apregnant female), into a representation of an existing structure (e.g.,a genome, a chromosome or segment thereof) present in the subject (e.g.,a maternal and/or fetal nucleic acid). In some embodiments, an outcomecomprises a transformation of sequence reads from a first subject (e.g.,a pregnant female), into a composite representation of structures (e.g.,a genome, a chromosome or segment thereof), and a second transformationof the composite representation that yields a representation of astructure present in a first subject (e.g., a pregnant female) and/or asecond subject (e.g., a fetus).

In certain embodiments an outcome can be generated according toanalyzing one or more candidate segments. In some embodiments thepresence of absence of a genetic variation is determined according to adiscrete segment, candidate segment or composite candidate segment(e.g., the presence or absence of a discrete segment, candidate segmentor composite candidate segment). In some embodiments two candidatesegments derived from two decomposition renderings of the same profileare substantially the same (e.g., according to a comparison) and thepresence of a chromosome aneuploidy, microduplication or microdeletionis determined. In some embodiments the presence of a composite candidatesegment indicates the presence of a chromosome aneuploidy,microduplication or microdeletion. In some embodiments the presence of awhole chromosome aneuploidy is determined according to the presence of adiscrete segment, candidate segment or composite candidate segment in aprofile and the profile is a segment of a genome (e.g., a segment largerthan a chromosome, e.g., a segment representing two or more chromosomes,a segment representing an entire genome). In some embodiments thepresence of a whole chromosome aneuploidy is determined according to thepresence of a discrete segment, candidate segment or composite candidatesegment in a profile and the discrete segment edges are substantiallythe same as the edges of a chromosome. In certain embodiments thepresence of a microduplication or microdeletion is determined when atleast one edge of a discrete segment, candidate segment or compositecandidate segment in a profile is different than an edge of a chromosomeand/or the discrete segment is within a chromosome. In some embodimentsthe presence of a microduplication is determined and a level or AUC fora discrete segment, candidate segment or composite candidate segment issubstantially larger than a reference level (e.g., a euploid region). Insome embodiments the presence of a microdeletion is determined and alevel or AUC for a discrete segment, candidate segment or compositecandidate segment is substantially less than a reference level (e.g., aeuploid region). In some embodiments candidate segments identified intwo or more different decomposition renderings are not substantially thesame (e.g., are different) and the absence of a chromosome aneuploidy,microduplication and/or microdeletion is determined. In some embodimentsthe absence of a discrete segment, candidate segment or compositecandidate segment in a profile or decomposition rendering of a profileindicates the absence of a chromosome aneuploidy, microduplication ormicrodeletion.

Use of Outcomes

A health care professional, or other qualified individual, receiving areport comprising one or more outcomes determinative of the presence orabsence of a genetic variation can use the displayed data in the reportto make a call regarding the status of the test subject or patient. Thehealthcare professional can make a recommendation based on the providedoutcome, in some embodiments. A health care professional or qualifiedindividual can provide a test subject or patient with a call or scorewith regards to the presence or absence of the genetic variation basedon the outcome value or values and associated confidence parametersprovided in a report, in some embodiments. In certain embodiments, ascore or call is made manually by a healthcare professional or qualifiedindividual, using visual observation of the provided report. In certainembodiments, a score or call is made by an automated routine, sometimesembedded in software, and reviewed by a healthcare professional orqualified individual for accuracy prior to providing information to atest subject or patient. The term “receiving a report” as used hereinrefers to obtaining, by a communication means, a written and/orgraphical representation comprising an outcome, which upon review allowsa healthcare professional or other qualified individual to make adetermination as to the presence or absence of a genetic variation in atest subject or patient. The report may be generated by a computer or byhuman data entry, and can be communicated using electronic means (e.g.,over the internet, via computer, via fax, from one network location toanother location at the same or different physical sites), or by a othermethod of sending or receiving data (e.g., mail service, courier serviceand the like). In some embodiments the outcome is transmitted to ahealth care professional in a suitable medium, including, withoutlimitation, in verbal, document, or file form. The file may be, forexample, but not limited to, an auditory file, a computer readable file,a paper file, a laboratory file or a medical record file.

The term “providing an outcome” and grammatical equivalents thereof, asused herein also can refer to a method for obtaining such information,including, without limitation, obtaining the information from alaboratory (e.g., a laboratory file). A laboratory file can be generatedby a laboratory that carried out one or more assays or one or more dataprocessing steps to determine the presence or absence of the medicalcondition. The laboratory may be in the same location or differentlocation (e.g., in another country) as the personnel identifying thepresence or absence of the medical condition from the laboratory file.For example, the laboratory file can be generated in one location andtransmitted to another location in which the information therein will betransmitted to the pregnant female subject. The laboratory file may bein tangible form or electronic form (e.g., computer readable form), incertain embodiments.

In some embodiments, an outcome can be provided to a health careprofessional, physician or qualified individual from a laboratory andthe health care professional, physician or qualified individual can makea diagnosis based on the outcome. In some embodiments, an outcome can beprovided to a health care professional, physician or qualifiedindividual from a laboratory and the health care professional, physicianor qualified individual can make a diagnosis based, in part, on theoutcome along with additional data and/or information and otheroutcomes.

A healthcare professional or qualified individual, can provide asuitable recommendation based on the outcome or outcomes provided in thereport. Non-limiting examples of recommendations that can be providedbased on the provided outcome report includes, surgery, radiationtherapy, chemotherapy, genetic counseling, after birth treatmentsolutions (e.g., life planning, long term assisted care, medicaments,symptomatic treatments), pregnancy termination, organ transplant, bloodtransfusion, the like or combinations of the foregoing. In someembodiments the recommendation is dependent on the outcome basedclassification provided (e.g., Down's syndrome, Turner syndrome, medicalconditions associated with genetic variations in T13, medical conditionsassociated with genetic variations in T18). Laboratory personnel (e.g.,a laboratory manager) can analyze values (e.g., test counts, referencecounts, level of deviation) underlying a determination of the presenceor absence of a genetic variation (or determination of euploid ornon-euploid for a test region). For calls pertaining to presence orabsence of a genetic variation that are close or questionable,laboratory personnel can re-order the same test, and/or order adifferent test (e.g., karyotyping and/or amniocentesis in the case offetal aneuploidy determinations), that makes use of the same ordifferent sample nucleic acid from a test subject.

Genetic Variations and Medical Conditions

The presence or absence of a genetic variance can be determined using amethod or apparatus described herein. In certain embodiments, thepresence or absence of one or more genetic variations is determinedaccording to an outcome provided by methods and apparatuses describedherein. A genetic variation generally is a particular genetic phenotypepresent in certain individuals, and often a genetic variation is presentin a statistically significant sub-population of individuals. In someembodiments, a genetic variation is a chromosome abnormality (e.g.,aneuploidy), partial chromosome abnormality or mosaicism, each of whichis described in greater detail herein. Non-limiting examples of geneticvariations include one or more deletions (e.g., micro-deletions),duplications (e.g., micro-duplications), insertions, mutations,polymorphisms (e.g., single-nucleotide polymorphisms), fusions, repeats(e.g., short tandem repeats), distinct methylation sites, distinctmethylation patterns, the like and combinations thereof. An insertion,repeat, deletion, duplication, mutation or polymorphism can be of anylength, and in some embodiments, is about 1 base or base pair (bp) toabout 250 megabases (Mb) in length. In some embodiments, an insertion,repeat, deletion, duplication, mutation or polymorphism is about 1 baseor base pair (bp) to about 1,000 kilobases (kb) in length (e.g., about10 bp, 50 bp, 100 bp, 500 bp, 1 kb, 5 kb, 10 kb, 50 kb, 100 kb, 500 kb,or 1000 kb in length).

A genetic variation is sometime a deletion. In certain embodiments adeletion is a mutation (e.g., a genetic aberration) in which a part of achromosome or a sequence of DNA is missing. A deletion is often the lossof genetic material. Any number of nucleotides can be deleted. Adeletion can comprise the deletion of one or more entire chromosomes, asegment of a chromosome, an allele, a gene, an intron, an exon, anynon-coding region, any coding region, a segment thereof or combinationthereof. A deletion can comprise a microdeletion. A deletion cancomprise the deletion of a single base.

A genetic variation is sometimes a genetic duplication. In certainembodiments a duplication is a mutation (e.g., a genetic aberration) inwhich a part of a chromosome or a sequence of DNA is copied and insertedback into the genome. In certain embodiments a genetic duplication (i.e.duplication) is any duplication of a region of DNA. In some embodimentsa duplication is a nucleic acid sequence that is repeated, often intandem, within a genome or chromosome. In some embodiments a duplicationcan comprise a copy of one or more entire chromosomes, a segment of achromosome, an allele, a gene, an intron, an exon, any non-codingregion, any coding region, segment thereof or combination thereof. Aduplication can comprise a microduplication. A duplication sometimescomprises one or more copies of a duplicated nucleic acid. A duplicationsometimes is characterized as a genetic region repeated one or moretimes (e.g., repeated 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10 times).Duplications can range from small regions (thousands of base pairs) towhole chromosomes in some instances. Duplications frequently occur asthe result of an error in homologous recombination or due to aretrotransposon event. Duplications have been associated with certaintypes of proliferative diseases. Duplications can be characterized usinggenomic microarrays or comparative genetic hybridization (CGH).

A genetic variation is sometimes an insertion. An insertion is sometimesthe addition of one or more nucleotide base pairs into a nucleic acidsequence. An insertion is sometimes a microinsertion. In certainembodiments an insertion comprises the addition of a segment of achromosome into a genome, chromosome, or segment thereof. In certainembodiments an insertion comprises the addition of an allele, a gene, anintron, an exon, any non-coding region, any coding region, segmentthereof or combination thereof into a genome or segment thereof. Incertain embodiments an insertion comprises the addition (i.e.,insertion) of nucleic acid of unknown origin into a genome, chromosome,or segment thereof. In certain embodiments an insertion comprises theaddition (i.e. insertion) of a single base.

As used herein a “copy number variation” generally is a class or type ofgenetic variation or chromosomal aberration. A copy number variation canbe a deletion (e.g. micro-deletion), duplication (e.g., amicro-duplication) or insertion (e.g., a micro-insertion). Often, theprefix “micro” as used herein sometimes is a segment of nucleic acidless than 5 Mb in length. A copy number variation can include one ormore deletions (e.g. micro-deletion), duplications and/or insertions(e.g., a micro-duplication, micro-insertion) of a segment of achromosome. In certain embodiments a duplication comprises an insertion.In certain embodiments an insertion is a duplication. In certainembodiments an insertion is not a duplication. For example, often aduplication of a sequence in a portion increases the counts for aportion in which the duplication is found. Often a duplication of asequence in a portion increases the level. In certain embodiments, aduplication present in portions making up a first level increases thelevel relative to a second level where a duplication is absent. Incertain embodiments an insertion increases the counts of a portion and asequence representing the insertion is present (i.e., duplicated) atanother location within the same portion. In certain embodiments aninsertion does not significantly increase the counts of a portion orlevel and the sequence that is inserted is not a duplication of asequence within the same portion. In certain embodiments an insertion isnot detected or represented as a duplication and a duplicate sequencerepresenting the insertion is not present in the same portion.

In some embodiments a copy number variation is a fetal copy numbervariation. Often, a fetal copy number variation is a copy numbervariation in the genome of a fetus. In some embodiments a copy numbervariation is a maternal and/or fetal copy number variation. In certainembodiments a maternal and/or fetal copy number variation is a copynumber variation within the genome of a pregnant female (e.g., a femalesubject bearing a fetus), a female subject that gave birth or a femalecapable of bearing a fetus. A copy number variation can be aheterozygous copy number variation where the variation (e.g., aduplication or deletion) is present on one allele of a genome. A copynumber variation can be a homozygous copy number variation where thevariation is present on both alleles of a genome. In some embodiments acopy number variation is a heterozygous or homozygous fetal copy numbervariation. In some embodiments a copy number variation is a heterozygousor homozygous maternal and/or fetal copy number variation. A copy numbervariation sometimes is present in a maternal genome and a fetal genome,a maternal genome and not a fetal genome, or a fetal genome and not amaternal genome.

“Ploidy” is a reference to the number of chromosomes present in a fetusor mother. In certain embodiments “Ploidy” is the same as “chromosomeploidy”. In humans, for example, autosomal chromosomes are often presentin pairs. For example, in the absence of a genetic variation, mosthumans have two of each autosomal chromosome (e.g., chromosomes 1-22).The presence of the normal complement of 2 autosomal chromosomes in ahuman is often referred to as euploid. “Microploidy” is similar inmeaning to ploidy. “Microploidy” often refers to the ploidy of a segmentof a chromosome. The term “microploidy” sometimes is a reference to thepresence or absence of a copy number variation (e.g., a deletion,duplication and/or an insertion) within a chromosome (e.g., a homozygousor heterozygous deletion, duplication, or insertion, the like or absencethereof). “Ploidy” and “microploidy” sometimes are determined afternormalization of counts of a level in a profile. Thus, a levelrepresenting an autosomal chromosome pair (e.g., a euploid) is oftennormalized to a ploidy of 1. Similarly, a level within a segment of achromosome representing the absence of a duplication, deletion orinsertion is often normalized to a microploidy of 1. Ploidy andmicroploidy are often portion-specific (e.g., portion specific) andsample-specific. Ploidy is often defined as integral multiples of/z,with the values of 1, ½, 0, 3/2, and 2 representing euploid (e.g., 2chromosomes), 1 chromosome present (e.g., a chromosome deletion), nochromosome present, 3 chromosomes (e.g., a trisomy) and 4 chromosomes,respectively. Likewise, microploidy is often defined as integralmultiples of ½, with the values of 1, ½, 0, 3/2, and 2 representingeuploid (e.g., no copy number variation), a heterozygous deletion,homozygous deletion, heterozygous duplication and homozygousduplication, respectively. Some examples of ploidy values for a fetusare provided in Table 2.

In certain embodiments the microploidy of a fetus matches themicroploidy of the mother of the fetus (i.e., the pregnant femalesubject). In certain embodiments the microploidy of a fetus matches themicroploidy of the mother of the fetus and both the mother and fetuscarry the same heterozygous copy number variation, homozygous copynumber variation or both are euploid. In certain embodiments themicroploidy of a fetus is different than the microploidy of the motherof the fetus. For example, sometimes the microploidy of a fetus isheterozygous for a copy number variation, the mother is homozygous for acopy number variation and the microploidy of the fetus does not match(e.g., does not equal) the microploidy of the mother for the specifiedcopy number variation.

A microploidy is often associated with an expected level. For example,sometimes a level (e.g., a level in a profile, sometimes a level thatincludes substantially no copy number variation) is normalized to avalue of 1 (e.g., a ploidy of 1, a microploidy of 1) and the microploidyof a homozygous duplication is 2, a heterozygous duplication is 1.5, aheterozygous deletion is 0.5 and a homozygous deletion is zero.

A genetic variation for which the presence or absence is identified fora subject is associated with a medical condition in certain embodiments.Thus, technology described herein can be used to identify the presenceor absence of one or more genetic variations that are associated with amedical condition or medical state. Non-limiting examples of medicalconditions include those associated with intellectual disability (e.g.,Down Syndrome), aberrant cell-proliferation (e.g., cancer), presence ofa micro-organism nucleic acid (e.g., virus, bacterium, fungus, yeast),and preeclampsia.

Non-limiting examples of genetic variations, medical conditions andstates are described hereafter.

Fetal Gender

In some embodiments, the prediction of a fetal gender or gender relateddisorder (e.g., sex chromosome aneuploidy) can be determined by a methodor apparatus described herein. Gender determination generally is basedon a sex chromosome. In humans, there are two sex chromosomes, the X andY chromosomes. The Y chromosome contains a gene, SRY, which triggersembryonic development as a male. The Y chromosomes of humans and othermammals also contain other genes needed for normal sperm production.Individuals with XX are female and XY are male and non-limitingvariations, often referred to as sex chromosome aneuploidies, includeXO, XYY, XXX and XXY. In certain embodiments, males have two Xchromosomes and one Y chromosome (XXY; Klinefelter's Syndrome), or one Xchromosome and two Y chromosomes (XYY syndrome; Jacobs Syndrome), andsome females have three X chromosomes (XXX; Triple X Syndrome) or asingle X chromosome instead of two (XO; Turner Syndrome). In certainembodiments, only a portion of cells in an individual are affected by asex chromosome aneuploidy which may be referred to as a mosaicism (e.g.,Turner mosaicism). Other cases include those where SRY is damaged(leading to an XY female), or copied to the X (leading to an XX male).

In certain cases, it can be beneficial to determine the gender of afetus in utero. For example, a patient (e.g., pregnant female) with afamily history of one or more sex-linked disorders may wish to determinethe gender of the fetus she is carrying to help assess the risk of thefetus inheriting such a disorder. Sex-linked disorders include, withoutlimitation, X-linked and Y-linked disorders. X-linked disorders includeX-linked recessive and X-linked dominant disorders. Examples of X-linkedrecessive disorders include, without limitation, immune disorders (e.g.,chronic granulomatous disease (CYBB), Wiskott-Aldrich syndrome, X-linkedsevere combined immunodeficiency, X-linked agammaglobulinemia, hyper-IgMsyndrome type 1, IPEX, X-linked lymphoproliferative disease, Properdindeficiency), hematologic disorders (e.g., Hemophilia A, Hemophilia B,X-linked sideroblastic anemia), endocrine disorders (e.g., androgeninsensitivity syndrome/Kennedy disease, KAL1 Kallmann syndrome, X-linkedadrenal hypoplasia congenital), metabolic disorders (e.g., ornithinetranscarbamylase deficiency, oculocerebrorenal syndrome,adrenoleukodystrophy, glucose-6-phosphate dehydrogenase deficiency,pyruvate dehydrogenase deficiency, Danon disease/glycogen storagedisease Type IIb, Fabry's disease, Hunter syndrome, Lesch-Nyhansyndrome, Menkes disease/occipital horn syndrome), nervous systemdisorders (e.g., Coffin-Lowry syndrome, MASA syndrome, X-linked alphathalassemia mental retardation syndrome, Siderius X-linked mentalretardation syndrome, color blindness, ocular albinism, Norrie disease,choroideremia, Charcot-Marie-Tooth disease (CMTX2-3),Pelizaeus-Merzbacher disease, SMAX2), skin and related tissue disorders(e.g., dyskeratosis congenital, hypohidrotic ectodermal dysplasia (EDA),X-linked ichthyosis, X-linked endothelial corneal dystrophy),neuromuscular disorders (e.g., Becker's muscular dystrophy/Duchenne,centronuclear myopathy (MTM1), Conradi-Hümermann syndrome,Emery-Dreifuss muscular dystrophy 1), urologic disorders (e.g., Alportsyndrome, Dent's disease, X-linked nephrogenic diabetes insipidus),bone/tooth disorders (e.g., AMELX Amelogenesis imperfecta), and otherdisorders (e.g., Barth syndrome, McLeod syndrome, Smith-Fineman-Myerssyndrome, Simpson-Golabi-Behmel syndrome, Mohr-Tranebjærg syndrome,Nasodigitoacoustic syndrome). Examples of X-linked dominant disordersinclude, without limitation, X-linked hypophosphatemia, Focal dermalhypoplasia, Fragile X syndrome, Aicardi syndrome, Incontinentiapigmenti, Rett syndrome, CHILD syndrome, Lujan-Fryns syndrome, andOrofaciodigital syndrome 1. Examples of Y-linked disorders include,without limitation, male infertility, retinitis pigmentosa, andazoospermia.

Chromosome Abnormalities

In some embodiments, the presence or absence of a fetal chromosomeabnormality can be determined by using a method or apparatus describedherein. Chromosome abnormalities include, without limitation, a gain orloss of an entire chromosome or a region of a chromosome comprising oneor more genes. Chromosome abnormalities include monosomies, trisomies,polysomies, loss of heterozygosity, translocations, deletions and/orduplications of one or more nucleotide sequences (e.g., one or moregenes), including deletions and duplications caused by unbalancedtranslocations. The term “chromosomal abnormality” or “aneuploidy” asused herein refers to a deviation between the structure of the subjectchromosome and a normal homologous chromosome. The term “normal” refersto the predominate karyotype or banding pattern found in healthyindividuals of a particular species, for example, a euploid genome (inhumans, 46,XX or 46,XY). As different organisms have widely varyingchromosome complements, the term “aneuploidy” does not refer to aparticular number of chromosomes, but rather to the situation in whichthe chromosome content within a given cell or cells of an organism isabnormal. In some embodiments, the term “aneuploidy” herein refers to animbalance of genetic material caused by a loss or gain of a wholechromosome, or part of a chromosome. An “aneuploidy” can refer to one ormore deletions and/or insertions of a segment of a chromosome. The term“euploid”, in some embodiments, refers a normal complement ofchromosomes.

The term “monosomy” as used herein refers to lack of one chromosome ofthe normal complement. Partial monosomy can occur in unbalancedtranslocations or deletions, in which only a segment of the chromosomeis present in a single copy. Monosomy of sex chromosomes (45, X) causesTurner syndrome, for example. The term “disomy” refers to the presenceof two copies of a chromosome. For organisms such as humans that havetwo copies of each chromosome (those that are diploid or “euploid”),disomy is the normal condition. For organisms that normally have threeor more copies of each chromosome (those that are triploid or above),disomy is an aneuploid chromosome state.

In uniparental disomy, both copies of a chromosome come from the sameparent (with no contribution from the other parent).

The term “trisomy” as used herein refers to the presence of threecopies, instead of two copies, of a particular chromosome. The presenceof an extra chromosome 21, which is found in human Down syndrome, isreferred to as “Trisomy 21.” Trisomy 18 and Trisomy 13 are two otherhuman autosomal trisomies. Trisomy of sex chromosomes can be seen infemales (e.g., 47, XXX in Triple X Syndrome) or males (e.g., 47, XXY inKlinefelter's Syndrome; or 47, XYY in Jacobs Syndrome). In someembodiments, a trisomy is a duplication of most or all of an autosome.In certain embodiments a trisomy is a whole chromosome aneuploidyresulting in three instances (e.g., three copies) of a particular typeof chromosome (e.g., instead of two instances (i.e., a pair) of aparticular type of chromosome for a euploid).

The terms “tetrasomy” and “pentasomy” as used herein refer to thepresence of four or five copies of a chromosome, respectively. Althoughrarely seen with autosomes, sex chromosome tetrasomy and pentasomy havebeen reported in humans, including XXXX, XXXY, XXYY, XYYY, XXXXX, XXXXY,XXXYY, XXYYY and XYYYY.

Chromosome abnormalities can be caused by a variety of mechanisms.Mechanisms include, but are not limited to (i) nondisjunction occurringas the result of a weakened mitotic checkpoint, (ii) inactive mitoticcheckpoints causing non-disjunction at multiple chromosomes, (iii)merotelic attachment occurring when one kinetochore is attached to bothmitotic spindle poles, (iv) a multipolar spindle forming when more thantwo spindle poles form, (v) a monopolar spindle forming when only asingle spindle pole forms, and (vi) a tetraploid intermediate occurringas an end result of the monopolar spindle mechanism.

The terms “partial monosomy” and “partial trisomy” as used herein referto an imbalance of genetic material caused by loss or gain of part of achromosome. A partial monosomy or partial trisomy can result from anunbalanced translocation, where an individual carries a derivativechromosome formed through the breakage and fusion of two differentchromosomes. In this situation, the individual would have three copiesof part of one chromosome (two normal copies and the segment that existson the derivative chromosome) and only one copy of part of the otherchromosome involved in the derivative chromosome.

The term “mosaicism” as used herein refers to aneuploidy in some cells,but not all cells, of an organism. Certain chromosome abnormalities canexist as mosaic and non-mosaic chromosome abnormalities. For example,certain trisomy 21 individuals have mosaic Down syndrome and some havenon-mosaic Down syndrome. Different mechanisms can lead to mosaicism.For example, (i) an initial zygote may have three 21st chromosomes,which normally would result in simple trisomy 21, but during the courseof cell division one or more cell lines lost one of the 21stchromosomes; and (ii) an initial zygote may have two 21st chromosomes,but during the course of cell division one of the 21st chromosomes wereduplicated. Somatic mosaicism likely occurs through mechanisms distinctfrom those typically associated with genetic syndromes involvingcomplete or mosaic aneuploidy. Somatic mosaicism has been identified incertain types of cancers and in neurons, for example. In certaininstances, trisomy 12 has been identified in chronic lymphocyticleukemia (CLL) and trisomy 8 has been identified in acute myeloidleukemia (AML). Also, genetic syndromes in which an individual ispredisposed to breakage of chromosomes (chromosome instabilitysyndromes) are frequently associated with increased risk for varioustypes of cancer, thus highlighting the role of somatic aneuploidy incarcinogenesis. Methods and protocols described herein can identifypresence or absence of non-mosaic and mosaic chromosome abnormalities.

Tables 1A and 1B present a non-limiting list of chromosome conditions,syndromes and/or abnormalities that can be potentially identified bymethods and apparatus described herein. Table 1B is from the DECIPHERdatabase as of Oct. 6, 2011 (e.g., version 5.1, based on positionsmapped to GRCh37; available at uniform resource locator (URL)dechipher.sanger.ac.uk).

TABLE 1A Chromo- some Abnormality Disease Association X XO Turner'sSyndrome Y XXY Klinefelter syndrome Y XYY Double Y syndrome Y XXXTrisomy X syndrome Y XXXX Four X syndrome Y Xp21 deletionDuchenne's/Becker syndrome, congenital adrenal hypoplasia, chronicgranulomatus disease Y Xp22 deletion steroid sulfatase deficiency Y Xq26deletion X-linked lymphoproliferative disease 1 1p (somatic)neuroblastoma monosomy trisomy 2 monosomy growth retardation,developmental and trisomy 2q mental delay, and minor physicalabnormalities 3 monosomy Non-Hodgkin's lymphoma trisomy (somatic) 4monosomy Acute non lymphocytic leukemia trisomy (somatic) (ANLL) 5 5p,5p minus Cri du chat; Lejeune syndrome 5 5q myelodysplastic syndrome(somatic) monosomy trisomy 6 monosomy clear-cell sarcoma trisomy(somatic) 7 7q11.23 deletion William's syndrome 7 monosomy monosomy 7syndrome of childhood; trisomy somatic: renal cortical adenomas;myelodysplastic syndrome 8 8q24.1 deletion Langer-Giedon syndrome 8monosomy myelodysplastic syndrome; Warkany trisomy syndrome; somatic:chronic myelogenous leukemia 9 monosomy 9p Alfi's syndrome 9 monosomy 9pRethore syndrome partial trisomy 9 trisomy complete trisomy 9 syndrome;mosaic trisomy 9 syndrome 10 Monosomy ALL or ANLL trisomy (somatic) 1111p- Aniridia; Wilms tumor 11 11q- Jacobsen Syndrome 11 monosomy myeloidlineages affected (ANLL, MDS) (somatic) trisomy 12 monosomy CLL,Juvenile granulosa cell tumor trisomy (somatic) (JGCT) 13 13q-13q-syndrome; Orbeli syndrome 13 13q14 deletion retinoblastoma 13monosomy Patau's syndrome trisomy 14 monosomy myeloid disorders (MDS,ANLL, trisomy (somatic) atypical CML) 15 15q11-q13 Prader-Willi,Angelman's syndrome deletion monosomy; 15q deletion 15 trisomy (somatic)myeloid and lymphoid lineages affected, e.g., MDS, ANLL, ALL, CLL) 16trisomy Full Trisomy 16 Mosaic Trisomy 16 16 16q13.3 deletionRubenstein-Taybi 3 monosomy papillary renal cell carcinomas trisomy(somatic) (malignant) 17 17p-(somatic) 17p syndrome in myeloidmalignancies 17 17q11.2 deletion Smith-Magenis 17 17q13.3 Miller-Dieker17 monosomy renal cortical adenomas trisomy (somatic) 17 17p11.2-12Charcot-Marie Tooth Syndrome type 1; trisomy HNPP 18 18p- 18p partialmonosomy syndrome or Grouchy Lamy Thieffry syndrome 18 18q- Grouchy LamySalmon Landry Syndrome 18 monosomy Edwards Syndrome trisomy 19 monosomytrisomy 20 20p- trisomy 20p syndrome 20 20p11.2-12 Alagille deletion 2020q- somatic: MDS, ANLL, polycythemia vera, chronic neutrophilicleukemia 20 monosomy papillary renal cell carcinomas trisomy (somatic)(malignant) 21 monosomy Down's syndrome trisomy 22 22q11.2 deletionDiGeorge's syndrome, velocardiofacial syndrome, conotruncal anomaly facesyndrome, autosomal dominant Opitz G/BBB syndrome, Caylor cardiofacialsyndrome 22 monosomy complete trisomy 22 syndrome trisomy

TABLE 1B Chro- Inter- mo- val Syndrome some Start End (Mb) Grade 12q1412 65,071,919 68,645,525 3.57 microdeletion syndrome 15q13.3 1530,769,995 32,701,482 1.93 microdeletion syndrome 15q24 recurrent 1574,377,174 76,162,277 1.79 microdeletion syndrome 15q26 overgrowth 1599,357,970 102,521,392 3.16 syndrome 16p11.2 16 29,501,198 30,202,5720.70 microduplication syndrome 16p11.2-p12.2 16 21,613,956 29,042,1927.43 microdeletion syndrome 16p13.11 16 15,504,454 16,284,248 0.78recurrent microdeletion (neurocognitive disorder susceptibility locus)16p13.11 16 15,504,454 16,284,248 0.78 recurrent microduplication(neurocognitive disorder susceptibility locus) 17q21.3 recurrent 1743,632,466 44,210,205 0.58 1 microdeletion syndrome 1p36 1 10,0015,408,761 5.40 1 microdeletion syndrome 1q21.1 recurrent 1 146,512,930147,737,500 1.22 3 microdeletion (susceptibility locus for neurodevelop-mental disorders) 1q21.1 recurrent 1 146,512,930 147,737,500 1.22 3microduplication (possible susceptibility locus for neurodevelop- mentaldisorders) 1q21.1 1 145,401,253 145,928,123 0.53 3 susceptibility locusfor Thrombo- cytopenia- Absent Radius (TAR) syndrome 22q11 deletion 2218,546,349 22,336,469 3.79 1 syndrome (Velocardiofacial/ DiGeorgesyndrome) 22q11 duplication 22 18,546,349 22,336,469 3.79 3 syndrome22q11.2 distal 22 22,115,848 23,696,229 1.58 deletion syndrome 22q13deletion 22 51,045,516 51,187,844 0.14 1 syndrome (Phelan- Mcdermidsyndrome) 2p15-16.1 2 57,741,796 61,738,334 4.00 microdeletion syndrome2q33.1 deletion 2 196,925,089 205,206,940 8.28 1 syndrome 2q37 monosomy2 239,954,693 243,102,476 3.15 1 3q29 3 195,672,229 197,497,869 1.83microdeletion syndrome 3q29 3 195,672,229 197,497,869 1.83microduplication syndrome 7q11.23 7 72,332,743 74,616,901 2.28duplication syndrome 8p23.1 deletion 8 8,119,295 11,765,719 3.65syndrome 9q subtelomeric 9 140,403,363 141,153,431 0.75 1 deletionsyndrome Adult-onset 5 126,063,045 126,204,952 0.14 autosomal dominantleukodystrophy (ADLD) Angelman 15 22,876,632 28,557,186 5.68 1 syndrome(Type 1) Angelman 15 23,758,390 28,557,186 4.80 1 syndrome (Type 2)ATR-16 16 60,001 834,372 0.77 1 syndrome AZFa Y 14,352,761 15,154,8620.80 AZFb Y 20,118,045 26,065,197 5.95 AZFb + AZFc Y 19,964,82627,793,830 7.83 AZFc Y 24,977,425 28,033,929 3.06 Cat-Eye 22 116,971,860 16.97  Syndrome (Type I) Charcot-Marie- 17 13,968,60715,434,038 1.47 1 Tooth syndrome type 1A (CMT1A) Cri du Chat 5 10,00111,723,854 11.71 1 Syndrome (5p deletion) Early-onset 21 27,037,95627,548,479 0.51 Alzheimer disease with cerebral amyloid angiopathyFamilial 5 112,101,596 112,221,377 0.12 Adenomatous Polyposis Hereditary17 13,968,607 15,434,038 1.47 1 Liability to Pressure Palsies (HNPP)Leri-Weill X 751,878 867,875 0.12 dyschondrostosis (LWD) - SHOX deletionLeri-Weill X 460,558 753,877 0.29 dyschondrostosis (LWD) - SHOX deletionMiller-Dieker 17 1 2,545,429 2.55 1 syndrome (MDS) NF1- 17 29,162,82230,218,667 1.06 1 microdeletion syndrome Pelizaeus- X 102,642,051103,131,767 0.49 Merzbacher disease Potocki-Lupski 17 16,706,02120,482,061 3.78 syndrome (17p11.2 duplication syndrome) Potocki-Shaffer11 43,985,277 46,064,560 2.08 1 syndrome Prader-Willi 15 22,876,63228,557,186 5.68 1 syndrome (Type 1) Prader-Willi 15 23,758,39028,557,186 4.80 1 Syndrome (Type 2) RCAD 17 34,907,366 36,076,803 1.17(renal cysts and diabetes) Rubinstein-Taybi 16 3,781,464 3,861,246 0.081 Syndrome Smith-Magenis 17 16,706,021 20,482,061 3.78 1 Syndrome Sotossyndrome 5 175,130,402 177,456,545 2.33 1 Split hand/foot 7 95,533,86096,779,486 1.25 malformation 1 (SHFM1) Steroid sulphatase X 6,441,9578,167,697 1.73 deficiency (STS) WAGR 11p13 11 31,803,509 32,510,988 0.71deletion syndrome Williams-Beuren 7 72,332,743 74,616,901 2.28 1Syndrome (WBS) Wolf-Hirschhorn 4 10,001 2,073,670 2.06 1 Syndrome Xq28(MECP2) X 152,749,900 153,390,999 0.64 duplication

Grade 1 conditions often have one or more of the followingcharacteristics; pathogenic anomaly; strong agreement amongstgeneticists; highly penetrant; may still have variable phenotype butsome common features; all cases in the literature have a clinicalphenotype; no cases of healthy individuals with the anomaly; notreported on DVG databases or found in healthy population; functionaldata confirming single gene or multi-gene dosage effect; confirmed orstrong candidate genes; clinical management implications defined; knowncancer risk with implication for surveillance; multiple sources ofinformation (OMIM, Genereviews, Orphanet, Unique, Wikipedia); and/oravailable for diagnostic use (reproductive counseling).

Grade 2 conditions often have one or more of the followingcharacteristics; likely pathogenic anomaly; highly penetrant; variablephenotype with no consistent features other than DD; small number ofcases/reports in the literature; all reported cases have a clinicalphenotype; no functional data or confirmed pathogenic genes; multiplesources of information (OMIM, Genereviews, Orphanet, Unique, Wikipedia);and/or may be used for diagnostic purposes and reproductive counseling.

Grade 3 conditions often have one or more of the followingcharacteristics; susceptibility locus; healthy individuals or unaffectedparents of a proband described; present in control populations; nonpenetrant; phenotype mild and not specific; features less consistent; nofunctional data or confirmed pathogenic genes; more limited sources ofdata; possibility of second diagnosis remains a possibility for casesdeviating from the majority or if novel clinical finding present; and/orcaution when using for diagnostic purposes and guarded advice forreproductive counseling.

Preeclampsia

In some embodiments, the presence or absence of preeclampsia isdetermined by using a method or apparatus described herein. Preeclampsiais a condition in which hypertension arises in pregnancy (i.e.pregnancy-induced hypertension) and is associated with significantamounts of protein in the urine. In certain embodiments, preeclampsiaalso is associated with elevated levels of extracellular nucleic acidand/or alterations in methylation patterns. For example, a positivecorrelation between extracellular fetal-derived hypermethylated RASSF1Alevels and the severity of pre-eclampsia has been observed. In certainexamples, increased DNA methylation is observed for the H19 gene inpreeclamptic placentas compared to normal controls.

Preeclampsia is one of the leading causes of maternal and fetal/neonatalmortality and morbidity worldwide. Circulating cell-free nucleic acidsin plasma and serum are novel biomarkers with promising clinicalapplications in different medical fields, including prenatal diagnosis.Quantitative changes of cell-free fetal (cff)DNA in maternal plasma asan indicator for impending preeclampsia have been reported in differentstudies, for example, using real-time quantitative PCR for themale-specific SRY or DYS 14 loci. In cases of early onset preeclampsia,elevated levels may be seen in the first trimester. The increased levelsof cffDNA before the onset of symptoms may be due tohypoxia/reoxygenation within the intervillous space leading to tissueoxidative stress and increased placental apoptosis and necrosis. Inaddition to the evidence for increased shedding of cffDNA into thematernal circulation, there is also evidence for reduced renal clearanceof cffDNA in preeclampsia. As the amount of fetal DNA is currentlydetermined by quantifying Y-chromosome specific sequences, alternativeapproaches such as measurement of total cell-free DNA or the use ofgender-independent fetal epigenetic markers, such as DNA methylation,offer an alternative. Cell-free RNA of placental origin is anotheralternative biomarker that may be used for screening and diagnosingpreeclampsia in clinical practice. Fetal RNA is associated withsubcellular placental particles that protect it from degradation. FetalRNA levels sometimes are ten-fold higher in pregnant females withpreeclampsia compared to controls, and therefore is an alternativebiomarker that may be used for screening and diagnosing preeclampsia inclinical practice.

Pathogens

In some embodiments, the presence or absence of a pathogenic conditionis determined by a method or apparatus described herein. A pathogeniccondition can be caused by infection of a host by a pathogen including,but not limited to, a bacterium, virus or fungus. Since pathogenstypically possess nucleic acid (e.g., genomic DNA, genomic RNA, mRNA)that can be distinguishable from host nucleic acid, methods andapparatus provided herein can be used to determine the presence orabsence of a pathogen. Often, pathogens possess nucleic acid withcharacteristics unique to a particular pathogen such as, for example,epigenetic state and/or one or more sequence variations, duplicationsand/or deletions. Thus, methods provided herein may be used to identifya particular pathogen or pathogen variant (e.g. strain).

Cancers

In some embodiments, the presence or absence of a cell proliferationdisorder (e.g., a cancer) is determined by using a method or apparatusdescribed herein. For example, levels of cell-free nucleic acid in serumcan be elevated in patients with various types of cancer compared withhealthy patients. Patients with metastatic diseases, for example, cansometimes have serum DNA levels approximately twice as high asnon-metastatic patients. Patients with metastatic diseases may also beidentified by cancer-specific markers and/or certain single nucleotidepolymorphisms or short tandem repeats, for example. Non-limitingexamples of cancer types that may be positively correlated with elevatedlevels of circulating DNA include breast cancer, colorectal cancer,gastrointestinal cancer, hepatocellular cancer, lung cancer, melanoma,non-Hodgkin lymphoma, leukemia, multiple myeloma, bladder cancer,hepatoma, cervical cancer, esophageal cancer, pancreatic cancer, andprostate cancer. Various cancers can possess, and can sometimes releaseinto the bloodstream, nucleic acids with characteristics that aredistinguishable from nucleic acids from non-cancerous healthy cells,such as, for example, epigenetic state and/or sequence variations,duplications and/or deletions. Such characteristics can, for example, bespecific to a particular type of cancer. Thus, it is furthercontemplated that a method provided herein can be used to identify aparticular type of cancer.

Software can be used to perform one or more steps in the processesdescribed herein, including but not limited to; counting, dataprocessing, generating an outcome, and/or providing one or morerecommendations based on generated outcomes, as described in greaterdetail hereafter.

Machines, Software and Interfaces

Certain processes and methods described herein (e.g., quantifying,partitioning, mapping, normalizing, range setting, adjusting,categorizing, counting and/or determining sequence reads, counts, levels(e.g., levels) and/or profiles) often cannot be performed without acomputer, processor, software, module or other apparatus. Methodsdescribed herein typically are computer-implemented methods, and one ormore portions of a method sometimes are performed by one or moreprocessors (e.g., microprocessors), computers, or microprocessorcontrolled apparatuses. Embodiments pertaining to methods described inthis document generally are applicable to the same or related processesimplemented by instructions in systems, apparatus and computer programproducts described herein. In some embodiments, processes and methodsdescribed herein (e.g., quantifying, partitioning, counting and/ordetermining sequence reads, counts, levels and/or profiles) areperformed by automated methods. In some embodiments one or more stepsand a method described herein is carried out by a processor and/orcomputer, and/or carried out in conjunction with memory. In someembodiments, an automated method is embodied in software, modules,processors, peripherals and/or an apparatus comprising the like, thatdetermine sequence reads, partitioning, counts, mapping, mapped sequencetags, levels, profiles, normalizations, comparisons, range setting,categorization, adjustments, plotting, outcomes, transformations andidentifications. As used herein, software refers to computer readableprogram instructions that, when executed by a processor, performcomputer operations, as described herein.

Sequence reads, counts, levels, and profiles derived from a test subject(e.g., a patient, a pregnant female) and/or from a reference subject canbe further analyzed and processed to determine the presence or absenceof a genetic variation. Sequence reads, counts, levels and/or profilessometimes are referred to as “data” or “data sets”. In some embodiments,data or data sets can be characterized by one or more features orvariables (e.g., sequence based [e.g., GC content, specific nucleotidesequence, the like], function specific [e.g., expressed genes, cancergenes, the like], location based [genome specific, chromosome specific,portion or portion specific], the like and combinations thereof). Incertain embodiments, data or data sets can be organized into a matrixhaving two or more dimensions based on one or more features orvariables. Data organized into matrices can be organized using anysuitable features or variables. A non-limiting example of data in amatrix includes data that is organized by maternal age, maternal ploidy,and fetal contribution. In certain embodiments, data sets characterizedby one or more features or variables sometimes are processed aftercounting.

Apparatuses, software and interfaces may be used to conduct methodsdescribed herein. Using apparatuses, software and interfaces, a user mayenter, request, query or determine options for using particularinformation, programs or processes (e.g., mapping sequence reads,processing mapped data and/or providing an outcome), which can involveimplementing statistical analysis algorithms, statistical significancealgorithms, statistical algorithms, iterative steps, validationalgorithms, and graphical representations, for example. In someembodiments, a data set may be entered by a user as input information, auser may download one or more data sets by a suitable hardware media(e.g., flash drive), and/or a user may send a data set from one systemto another for subsequent processing and/or providing an outcome (e.g.,send sequence read data from a sequencer to a computer system forsequence read mapping; send mapped sequence data to a computer systemfor processing and yielding an outcome and/or report).

A system typically comprises one or more apparatus. Each apparatuscomprises one or more of memory, one or more processors, andinstructions. Where a system includes two or more apparatus, some or allof the apparatus may be located at the same location, some or all of theapparatus may be located at different locations, all of the apparatusmay be located at one location and/or all of the apparatus may belocated at different locations. Where a system includes two or moreapparatus, some or all of the apparatus may be located at the samelocation as a user, some or all of the apparatus may be located at alocation different than a user, all of the apparatus may be located atthe same location as the user, and/or all of the apparatus may belocated at one or more locations different than the user.

A system sometimes comprises a computing apparatus and a sequencingapparatus, where the sequencing apparatus is configured to receivephysical nucleic acid and generate sequence reads, and the computingapparatus is configured to process the reads from the sequencingapparatus. The computing apparatus sometimes is configured to determinethe presence or absence of a genetic variation (e.g., copy numbervariation; fetal chromosome aneuploidy) from the sequence reads.

A user may, for example, place a query to software which then mayacquire a data set via internet access, and in certain embodiments, aprogrammable processor may be prompted to acquire a suitable data setbased on given parameters. A programmable processor also may prompt auser to select one or more data set options selected by the processorbased on given parameters. A programmable processor may prompt a user toselect one or more data set options selected by the processor based oninformation found via the internet, other internal or externalinformation, or the like. Options may be chosen for selecting one ormore data feature selections, one or more statistical algorithms, one ormore statistical analysis algorithms, one or more statisticalsignificance algorithms, iterative steps, one or more validationalgorithms, and one or more graphical representations of methods,apparatuses, or computer programs.

Systems addressed herein may comprise general components of computersystems, such as, for example, network servers, laptop systems, desktopsystems, handheld systems, personal digital assistants, computingkiosks, and the like. A computer system may comprise one or more inputmeans such as a keyboard, touch screen, mouse, voice recognition orother means to allow the user to enter data into the system. A systemmay further comprise one or more outputs, including, but not limited to,a display screen (e.g., CRT or LCD), speaker, FAX machine, printer(e.g., laser, ink jet, impact, black and white or color printer), orother output useful for providing visual, auditory and/or hardcopyoutput of information (e.g., outcome and/or report).

In a system, input and output means may be connected to a centralprocessing unit which may comprise among other components, amicroprocessor for executing program instructions and memory for storingprogram code and data. In some embodiments, processes may be implementedas a single user system located in a single geographical site. Incertain embodiments, processes may be implemented as a multi-usersystem. In the case of a multi-user implementation, multiple centralprocessing units may be connected by means of a network. The network maybe local, encompassing a single department in one portion of a building,an entire building, span multiple buildings, span a region, span anentire country or be worldwide. The network may be private, being ownedand controlled by a provider, or it may be implemented as an internetbased service where the user accesses a web page to enter and retrieveinformation. Accordingly, in certain embodiments, a system includes oneor more machines, which may be local or remote with respect to a user.More than one machine in one location or multiple locations may beaccessed by a user, and data may be mapped and/or processed in seriesand/or in parallel. Thus, a suitable configuration and control may beutilized for mapping and/or processing data using multiple machines,such as in local network, remote network and/or “cloud” computingplatforms.

A system can include a communications interface in some embodiments. Acommunications interface allows for transfer of software and databetween a computer system and one or more external devices. Non-limitingexamples of communications interfaces include a modem, a networkinterface (such as an Ethernet card), a communications port, a PCMCIAslot and card, and the like. Software and data transferred via acommunications interface generally are in the form of signals, which canbe electronic, electromagnetic, optical and/or other signals capable ofbeing received by a communications interface. Signals often are providedto a communications interface via a channel. A channel often carriessignals and can be implemented using wire or cable, fiber optics, aphone line, a cellular phone link, an RF link and/or othercommunications channels. Thus, in an example, a communications interfacemay be used to receive signal information that can be detected by asignal detection module.

Data may be input by a suitable device and/or method, including, but notlimited to, manual input devices or direct data entry devices (DDEs).Non-limiting examples of manual devices include keyboards, conceptkeyboards, touch sensitive screens, light pens, mouse, tracker balls,joysticks, graphic tablets, scanners, digital cameras, video digitizersand voice recognition devices. Non-limiting examples of DDEs include barcode readers, magnetic strip codes, smart cards, magnetic ink characterrecognition, optical character recognition, optical mark recognition,and turnaround documents.

In some embodiments, output from a sequencing apparatus may serve asdata that can be input via an input device. In certain embodiments,mapped sequence reads may serve as data that can be input via an inputdevice. In certain embodiments, simulated data is generated by an insilico process and the simulated data serves as data that can be inputvia an input device. The term “in silico” refers to research andexperiments performed using a computer. In silico processes include, butare not limited to, mapping sequence reads and processing mappedsequence reads according to processes described herein.

A system may include software useful for performing a process describedherein, and software can include one or more modules for performing suchprocesses (e.g., sequencing module, logic processing module, datadisplay organization module). The term “software” refers to computerreadable program instructions that, when executed by a computer, performcomputer operations. Instructions executable by the one or moreprocessors sometimes are provided as executable code, that whenexecuted, can cause one or more processors to implement a methoddescribed herein. A module described herein can exist as software, andinstructions (e.g., processes, routines, subroutines) embodied in thesoftware can be implemented or performed by a processor. For example, amodule (e.g., a software module) can be a part of a program thatperforms a particular process or task. The term “module” refers to aself-contained functional unit that can be used in a larger apparatus orsoftware system. A module can comprise a set of instructions forcarrying out a function of the module. A module can transform dataand/or information. Data and/or information can be in a suitable form.For example, data and/or information can be digital or analogue. Incertain embodiments, data and/or information can be packets, bytes,characters, or bits. In some embodiments, data and/or information can beany gathered, assembled or usable data or information. Non-limitingexamples of data and/or information include a suitable media, pictures,video, sound (e.g. frequencies, audible or non-audible), numbers,constants, a value, objects, time, functions, instructions, maps,references, sequences, reads, mapped reads, levels, ranges, thresholds,signals, displays, representations, or transformations thereof. A modulecan accept or receive data and/or information, transform the data and/orinformation into a second form, and provide or transfer the second formto an apparatus, peripheral, component or another module. A module canperform one or more of the following non-limiting functions:partitioning a reference genome, or part thereof, mapping sequencereads, providing counts, assembling portions, providing or determining alevel, providing a count profile, normalizing (e.g., normalizing reads,normalizing counts, and the like), providing a normalized count profileor levels of normalized counts, comparing two or more levels, providinguncertainty values, providing or determining expected levels andexpected ranges (e.g., expected level ranges, threshold ranges andthreshold levels), providing adjustments to levels (e.g., adjusting afirst level, adjusting a second level, adjusting a profile of achromosome or a segment thereof, and/or padding), providingidentification (e.g., identifying a copy number variation, geneticvariation or aneuploidy), categorizing, plotting, and/or determining anoutcome, for example. A processor can, in certain embodiments, carry outthe instructions in a module. In some embodiments, one or moreprocessors are required to carry out instructions in a module or groupof modules. A module can provide data and/or information to anothermodule, apparatus or source and can receive data and/or information fromanother module, apparatus or source.

A computer program product sometimes is embodied on a tangiblecomputer-readable medium, and sometimes is tangibly embodied on anon-transitory computer-readable medium. A module sometimes is stored ona computer readable medium (e.g., disk, drive) or in memory (e.g.,random access memory). A module and processor capable of implementinginstructions from a module can be located in an apparatus or indifferent apparatus. A module and/or processor capable of implementingan instruction for a module can be located in the same location as auser (e.g., local network) or in a different location from a user (e.g.,remote network, cloud system). In embodiments in which a method iscarried out in conjunction with two or more modules, the modules can belocated in the same apparatus, one or more modules can be located indifferent apparatus in the same physical location, and one or moremodules may be located in different apparatus in different physicallocations.

An apparatus, in some embodiments, comprises at least one processor forcarrying out the instructions in a module. Counts of sequence readsmapped to portions of a reference genome sometimes are accessed by aprocessor that executes instructions configured to carry out a methoddescribed herein. Counts that are accessed by a processor can be withinmemory of a system, and the counts can be accessed and placed into thememory of the system after they are obtained. In some embodiments, anapparatus includes a processor (e.g., one or more processors) whichprocessor can perform and/or implement one or more instructions (e.g.,processes, routines and/or subroutines) from a module. In someembodiments, an apparatus includes multiple processors, such asprocessors coordinated and working in parallel. In some embodiments, anapparatus operates with one or more external processors (e.g., aninternal or external network, server, storage device and/or storagenetwork (e.g., a cloud)). In some embodiments, an apparatus comprises amodule. In certain embodiments an apparatus comprises one or moremodules. An apparatus comprising a module often can receive and transferone or more of data and/or information to and from other modules. Incertain embodiments, an apparatus comprises peripherals and/orcomponents. In certain embodiments an apparatus can comprise one or moreperipherals or components that can transfer data and/or information toand from other modules, peripherals and/or components. In certainembodiments an apparatus interacts with a peripheral and/or componentthat provides data and/or information. In certain embodimentsperipherals and components assist an apparatus in carrying out afunction or interact directly with a module. Non-limiting examples ofperipherals and/or components include a suitable computer peripheral,I/O or storage method or device including but not limited to scanners,printers, displays (e.g., monitors, LED, LCT or CRTs), cameras,microphones, pads (e.g., ipads, tablets), touch screens, smart phones,mobile phones, USB I/O devices, USB mass storage devices, keyboards, acomputer mouse, digital pens, modems, hard drives, jump drives, flashdrives, a processor, a server, CDs, DVDs, graphic cards, specialized I/Odevices (e.g., sequencers, photo cells, photo multiplier tubes, opticalreaders, sensors, etc.), one or more flow cells, fluid handlingcomponents, network interface controllers, ROM, RAM, wireless transfermethods and devices (Bluetooth, WiFi, and the like,), the world wide web(www), the internet, a computer and/or another module.

Software often is provided on a program product containing programinstructions recorded on a computer readable medium, including, but notlimited to, magnetic media including floppy disks, hard disks, andmagnetic tape; and optical media including CD-ROM discs, DVD discs,magneto-optical discs, flash drives, RAM, floppy discs, the like, andother such media on which the program instructions can be recorded. Inonline implementation, a server and web site maintained by anorganization can be configured to provide software downloads to remoteusers, or remote users may access a remote system maintained by anorganization to remotely access software. Software may obtain or receiveinput information. Software may include a module that specificallyobtains or receives data (e.g., a data receiving module that receivessequence read data and/or mapped read data) and may include a modulethat specifically processes the data (e.g., a processing module thatprocesses received data (e.g., filters, normalizes, provides an outcomeand/or report). The terms “obtaining” and “receiving” input informationrefers to receiving data (e.g., sequence reads, mapped reads) bycomputer communication means from a local, or remote site, human dataentry, or any other method of receiving data. The input information maybe generated in the same location at which it is received, or it may begenerated in a different location and transmitted to the receivinglocation. In some embodiments, input information is modified before itis processed (e.g., placed into a format amenable to processing (e.g.,tabulated)).

In some embodiments, provided are computer program products, such as,for example, a computer program product comprising a computer usablemedium having a computer readable program code embodied therein, thecomputer readable program code adapted to be executed to implement amethod comprising: (a) obtaining sequence reads of sample nucleic acidfrom a test subject; (b) partitioning a reference genome, or partthereof; (c) mapping the sequence reads obtained in (a) to a referencegenome, which reference genome has been divided into portions accordingto the partitioning in (b); (d) counting the mapped sequence readswithin the portions; (e) generating a sample normalized count profile bynormalizing the counts for the portions obtained in (d); and (f)determining the presence or absence of a genetic variation from thesample normalized count profile in (e).

Software can include one or more algorithms in certain embodiments. Analgorithm may be used for processing data and/or providing an outcome orreport according to a finite sequence of instructions. An algorithmoften is a list of defined instructions for completing a task. Startingfrom an initial state, the instructions may describe a computation thatproceeds through a defined series of successive states, eventuallyterminating in a final ending state. The transition from one state tothe next is not necessarily deterministic (e.g., some algorithmsincorporate randomness). By way of example, and without limitation, analgorithm can be a search algorithm, sorting algorithm, merge algorithm,numerical algorithm, graph algorithm, string algorithm, modelingalgorithm, computational genometric algorithm, combinatorial algorithm,machine learning algorithm, cryptography algorithm, data compressionalgorithm, parsing algorithm and the like. An algorithm can include onealgorithm or two or more algorithms working in combination. An algorithmcan be of any suitable complexity class and/or parameterized complexity.An algorithm can be used for calculation and/or data processing, and insome embodiments, can be used in a deterministic orprobabilistic/predictive approach. An algorithm can be implemented in acomputing environment by use of a suitable programming language,non-limiting examples of which are C, C++, Java, Perl, Python, Fortran,and the like. In some embodiments, an algorithm can be configured ormodified to include margin of errors, statistical analysis, statisticalsignificance, and/or comparison to other information or data sets (e.g.,applicable when using a neural net or clustering algorithm).

In certain embodiments, several algorithms may be implemented for use insoftware. These algorithms can be trained with raw data in someembodiments. For each new raw data sample, the trained algorithms mayproduce a representative processed data set or outcome. A processed dataset sometimes is of reduced complexity compared to the parent data setthat was processed. Based on a processed set, the performance of atrained algorithm may be assessed based on sensitivity and specificity,in some embodiments. An algorithm with the highest sensitivity and/orspecificity may be identified and utilized, in certain embodiments.

In certain embodiments, simulated (or simulation) data can aid dataprocessing, for example, by training an algorithm or testing analgorithm. In some embodiments, simulated data includes hypotheticalvarious samplings of different groupings of sequence reads. Simulateddata may be based on what might be expected from a real population ormay be skewed to test an algorithm and/or to assign a correctclassification. Simulated data also is referred to herein as “virtual”data. Simulations can be performed by a computer program in certainembodiments. One possible step in using a simulated data set is toevaluate the confidence of an identified result, e.g., how well a randomsampling matches or best represents the original data. One approach isto calculate a probability value (p-value), which estimates theprobability of a random sample having better score than the selectedsamples. In some embodiments, an empirical model may be assessed, inwhich it is assumed that at least one sample matches a reference sample(with or without resolved variations). In some embodiments, anotherdistribution, such as a Poisson distribution for example, can be used todefine the probability distribution.

A system may include one or more processors in certain embodiments. Aprocessor can be connected to a communication bus. A computer system mayinclude a main memory, often random access memory (RAM), and can alsoinclude a secondary memory. Memory in some embodiments comprises anon-transitory computer-readable storage medium. Secondary memory caninclude, for example, a hard disk drive and/or a removable storagedrive, representing a floppy disk drive, a magnetic tape drive, anoptical disk drive, memory card and the like. A removable storage driveoften reads from and/or writes to a removable storage unit. Non-limitingexamples of removable storage units include a floppy disk, magnetictape, optical disk, and the like, which can be read by and written toby, for example, a removable storage drive. A removable storage unit caninclude a computer-usable storage medium having stored therein computersoftware and/or data.

A processor may implement software in a system. In some embodiments, aprocessor may be programmed to automatically perform a task describedherein that a user could perform. Accordingly, a processor, or algorithmconducted by such a processor, can require little to no supervision orinput from a user (e.g., software may be programmed to implement afunction automatically). In some embodiments, the complexity of aprocess is so large that a single person or group of persons could notperform the process in a timeframe short enough for determining thepresence or absence of a genetic variation.

In some embodiments, secondary memory may include other similar meansfor allowing computer programs or other instructions to be loaded into acomputer system. For example, a system can include a removable storageunit and an interface device. Non-limiting examples of such systemsinclude a program cartridge and cartridge interface (such as that foundin video game devices), a removable memory chip (such as an EPROM, orPROM) and associated socket, and other removable storage units andinterfaces that allow software and data to be transferred from theremovable storage unit to a computer system.

One entity can generate counts of sequence reads, partition a referencegenome, or part thereof, map the sequence reads to portions, count themapped reads, and utilize the counted mapped reads in a method, system,apparatus or computer program product described herein, in someembodiments. Counts of sequence reads mapped to portions sometimes aretransferred by one entity to a second entity for use by the secondentity in a method, system, apparatus or computer program productdescribed herein, in certain embodiments.

In some embodiments, one entity generates sequence reads and a secondentity maps those sequence reads to portions in a reference genome insome embodiments. The second entity sometimes counts the mapped readsand utilizes the counted mapped reads in a method, system, apparatus orcomputer program product described herein. In certain embodiments thesecond entity transfers the mapped reads to a third entity, and thethird entity counts the mapped reads and utilizes the mapped reads in amethod, system, apparatus or computer program product described herein.In certain embodiments the second entity counts the mapped reads andtransfers the counted mapped reads to a third entity, and the thirdentity utilizes the counted mapped reads in a method, system, apparatusor computer program product described herein. In embodiments involving athird entity, the third entity sometimes is the same as the firstentity. That is, the first entity sometimes transfers sequence reads toa second entity, which second entity can map sequence reads to portionsin a reference genome (e.g., a partitioned reference genome, or partthereof) and/or count the mapped reads, and the second entity cantransfer the mapped and/or counted reads to a third entity. A thirdentity sometimes can utilize the mapped and/or counted reads in amethod, system, apparatus or computer program product described herein,where the third entity sometimes is the same as the first entity, andsometimes the third entity is different from the first or second entity.In some embodiments a fourth entity generates the partitioned referencegenome, or part thereof.

In some embodiments, one entity obtains blood from a pregnant female,optionally isolates nucleic acid from the blood (e.g., from the plasmaor serum), and transfers the blood or nucleic acid to a second entitythat generates sequence reads from the nucleic acid.

FIG. 12 illustrates a non-limiting example of a computing environment510 in which various systems, methods, algorithms, and data structuresdescribed herein may be implemented. The computing environment 510 isonly one example of a suitable computing environment and is not intendedto suggest any limitation as to the scope of use or functionality of thesystems, methods, and data structures described herein. Neither shouldcomputing environment 510 be interpreted as having any dependency orrequirement relating to any one or combination of components illustratedin computing environment 510. A subset of systems, methods, and datastructures shown in FIG. 12 can be utilized in certain embodiments.Systems, methods, and data structures described herein are operationalwith numerous other general purpose or special purpose computing systemenvironments or configurations. Examples of known computing systems,environments, and/or configurations that may be suitable include, butare not limited to, personal computers, server computers, thin clients,thick clients, hand-held or laptop devices, multiprocessor systems,microprocessor-based systems, set top boxes, programmable consumerelectronics, network PCs, minicomputers, mainframe computers,distributed computing environments that include any of the above systemsor devices, and the like.

The operating environment 510 of FIG. 12 includes a general purposecomputing device in the form of a computer 520, including a processingunit 521, a system memory 522, and a system bus 523 that operativelycouples various system components including the system memory 522 to theprocessing unit 521. There may be only one or there may be more than oneprocessing unit 521, such that the processor of computer 520 includes asingle central-processing unit (CPU), or a plurality of processingunits, commonly referred to as a parallel processing environment. Thecomputer 520 may be a conventional computer, a distributed computer, orany other type of computer.

The system bus 523 may be any of several types of bus structuresincluding a memory bus or memory controller, a peripheral bus, and alocal bus using any of a variety of bus architectures. The system memorymay also be referred to as simply the memory, and includes read onlymemory (ROM) 524 and random access memory (RAM). A basic input/outputsystem (BIOS) 526, containing the basic routines that help to transferinformation between elements within the computer 520, such as duringstart-up, is stored in ROM 524. The computer 520 may further include ahard disk drive interface 527 for reading from and writing to a harddisk, not shown, a magnetic disk drive 528 for reading from or writingto a removable magnetic disk 529, and an optical disk drive 530 forreading from or writing to a removable optical disk 531 such as a CD ROMor other optical media.

The hard disk drive 527, magnetic disk drive 528, and optical disk drive530 are connected to the system bus 523 by a hard disk drive interface532, a magnetic disk drive interface 533, and an optical disk driveinterface 534, respectively. The drives and their associatedcomputer-readable media provide nonvolatile storage of computer-readableinstructions, data structures, program modules and other data for thecomputer 520. Any type of computer-readable media that can store datathat is accessible by a computer, such as magnetic cassettes, flashmemory cards, digital video disks, Bernoulli cartridges, random accessmemories (RAMs), read only memories (ROMs), and the like, may be used inthe operating environment.

A number of program modules may be stored on the hard disk, magneticdisk 529, optical disk 531, ROM 524, or RAM, including an operatingsystem 535, one or more application programs 536, other program modules537, and program data 538. A user may enter commands and informationinto the personal computer 520 through input devices such as a keyboard540 and pointing device 542. Other input devices (not shown) may includea microphone, joystick, game pad, satellite dish, scanner, or the like.These and other input devices are often connected to the processing unit521 through a serial port interface 546 that is coupled to the systembus, but may be connected by other interfaces, such as a parallel port,game port, or a universal serial bus (USB). A monitor 547 or other typeof display device is also connected to the system bus 523 via aninterface, such as a video adapter 548. In addition to the monitor,computers typically include other peripheral output devices (not shown),such as speakers and printers.

The computer 520 may operate in a networked environment using logicalconnections to one or more remote computers, such as remote computer549. These logical connections may be achieved by a communication devicecoupled to or a part of the computer 520, or in other manners. Theremote computer 549 may be another computer, a server, a router, anetwork PC, a client, a peer device or other common network node, andtypically includes many or all of the elements described above relativeto the computer 520, although only a memory storage device 550 has beenillustrated in FIG. 12. The logical connections depicted in FIG. 12include a local-area network (LAN) 551 and a wide-area network (WAN)552. Such networking environments are commonplace in office networks,enterprise-wide computer networks, intranets and the Internet, which allare types of networks.

When used in a LAN-networking environment, the computer 520 is connectedto the local network 551 through a network interface or adapter 553,which is one type of communications device. When used in aWAN-networking environment, the computer 520 often includes a modem 554,a type of communications device, or any other type of communicationsdevice for establishing communications over the wide area network 552.The modem 554, which may be internal or external, is connected to thesystem bus 523 via the serial port interface 546. In a networkedenvironment, program modules depicted relative to the personal computer520, or portions thereof, may be stored in the remote memory storagedevice. It is appreciated that the network connections shown arenon-limiting examples and other communications devices for establishinga communications link between computers may be used.

Modules

One or more modules can be utilized in a method described herein,non-limiting examples of which include a logic processing module,sequencing module, partitioning module, mapping module, counting module,filtering module, weighting module, normalization module, GC biasmodule, level module, comparison module, range setting module,categorization module, plotting module, representation module,relationship module, outcome module and/or data display organizationmodule, the like or combination thereof. Modules are sometimescontrolled by a microprocessor. In certain embodiments a module or anapparatus comprising one or more modules, gather, assemble, receive,obtain, access, recover provide and/or transfer data and/or informationto or from another module, apparatus, component, peripheral or operatorof an apparatus. In some embodiments, data and/or information (e.g.,sequencing reads) are provided to a module by an apparatus comprisingone or more of the following: one or more flow cells, a camera, adetector (e.g., a photo detector, a photo cell, an electrical detector(e.g., an amplitude modulation detector, a frequency and phasemodulation detector, a phase-locked loop detector), a counter, a sensor(e.g., a sensor of pressure, temperature, volume, flow, weight), a fluidhandling device, a printer, a display (e.g., an LED, LCT or CRT), thelike or combinations thereof. For example, sometimes an operator of anapparatus provides a constant, a threshold value, a formula or apredetermined value to a module. A module is often configured totransfer data and/or information to or from another module or apparatus.A module can receive data and/or information from another module,non-limiting examples of which include a logic processing module,sequencing module, partitioning module, mapping module, counting module,filtering module, weighting module, normalization module, GC biasmodule, level module, comparison module, range setting module,categorization module, plotting module, representation module,relationship module, outcome module and/or data display organizationmodule, the like or combination thereof. A module can manipulate and/ortransform data and/or information. Data and/or information derived fromor transformed by a module can be transferred to another suitableapparatus and/or module, non-limiting examples of which include a logicprocessing module, sequencing module, partitioning module, mappingmodule, counting module, filtering module, weighting module,normalization module, GC bias module, level module, comparison module,range setting module, categorization module, plotting module,representation module, relationship module, outcome module and/or datadisplay organization module, the like or combination thereof. Anapparatus comprising a module can comprise at least one processor. Insome embodiments, data and/or information are received by and/orprovided by an apparatus comprising a module. An apparatus comprising amodule can include a processor (e.g., one or more processors) whichprocessor can perform and/or implement one or more instructions (e.g.,processes, routines and/or subroutines) of a module. In someembodiments, a module operates with one or more external processors(e.g., an internal or external network, server, storage device and/orstorage network (e.g., a cloud)).

Logic Processing Module

In certain embodiments a logic processing module orchestrates, controls,limits, organizes, orders, distributes, partitions, transforms and/orregulates data and/or information or the transfer of data and/orinformation to and from one or more other modules, peripherals ordevices.

Data Display Organization Module

In certain embodiments a data display organization module processesand/or transforms data and/or information into a suitable visual mediumnon-limiting examples of which include images, video and/or text (e.g.,numbers, letters and symbols). In some embodiments, a data displayorganization module processes, transforms and/or transfers data and/orinformation for presentation on a suitable display (e.g., a monitor,LED, LCD, CRT, the like or combinations thereof), a printer, a suitableperipheral or device. In some embodiments, a data display organizationmodule processes, transforms data and/or information into a visualrepresentation of a fetal or maternal genome, chromosome or partthereof.

Sequencing Module

In some embodiments, a sequence module obtains, generates, gathers,assembles, manipulates, transforms, processes, transforms and/ortransfers sequence reads. A “sequence receiving module” as used hereinis the same as a “sequencing module”. An apparatus comprising asequencing module can be any apparatus that determines the sequence of anucleic acid utilizing a sequencing technology known in the art. In someembodiments a sequencing module can align, assemble, fragment,complement, reverse complement, error check, or error correct sequencereads.

Partitioning Module

A reference genome or part thereof (e.g., a chromosome, or part thereof,in a reference genome) can be partitioned by a partitioning module. Apartitioning module can partition a reference genome, or part thereof,by a method described herein. In some embodiments, a partitioning moduleor an apparatus comprising a partitioning module is required to providea partitioned reference genome, or part thereof.

Mapping Module

Sequence reads can be mapped by a mapping module or by an apparatuscomprising a mapping module, which mapping module generally maps readsto a reference genome or segment thereof. A mapping module can mapsequencing reads by a suitable method known in the art. In someembodiments, a mapping module or an apparatus comprising a mappingmodule is required to provide mapped sequence reads.

Counting Module

Counts can be provided by a counting module or by an apparatuscomprising a counting module. In some embodiments a counting modulecounts sequence reads mapped to a reference genome. In some embodimentsa counting module generates, assembles, and/or provides counts accordingto a counting method known in the art. In some embodiments, a countingmodule or an apparatus comprising a counting module is required toprovide counts.

Filtering Module

Filtering portions (e.g., portions of a reference genome) can beprovided by a filtering module (e.g., by an apparatus comprising afiltering module). In some embodiments, a filtering module is requiredto provide filtered portion data (e.g., filtered portions) and/or toremove portions from consideration. In certain embodiments a filteringmodule removes counts mapped to a portion from consideration. In certainembodiments a filtering module removes counts mapped to a portion from adetermination of a level or a profile. A filtering module can filterdata (e.g., counts, counts mapped to portions, portions, portion levels,normalized counts, raw counts, and the like) by one or more filteringmethods known in the art or described herein.

Weighting Module

Weighting portions (e.g., portions of a reference genome) can beprovided by a weighting module (e.g., by an apparatus comprising aweighting module). In some embodiments, a weighting module is requiredto weight genomics sections and/or provide weighted portion values. Aweighting module can weight portions by one or more weighting methodsknown in the art or described herein.

Normalization Module

Normalized data (e.g., normalized counts) can be provided by anormalization module (e.g., by an apparatus comprising a normalizationmodule). In some embodiments, a normalization module is required toprovide normalized data (e.g., normalized counts) obtained fromsequencing reads. A normalization module can normalize data (e.g.,counts, filtered counts, raw counts) by one or more normalizationmethods described herein (e.g., ChAI, hybrid normalization, the like orcombinations thereof) or known in the art.

GC Bias Module

Determining GC bias (e.g., determining GC bias for each of the portionsof a reference genome (e.g., portions, portions of a reference genome))can be provided by a GC bias module (e.g., by an apparatus comprising aGC bias module). In some embodiments, a GC bias module is required toprovide a determination of GC bias. In some embodiments a GC bias moduleprovides a determination of GC bias from a fitted relationship (e.g., afitted linear relationship) between counts of sequence reads mapped toeach of the portions of a reference genome and GC content of eachportion. A GC bias module sometimes is part of a normalization module(e.g., ChAI normalization module).

Level Module

Determining levels (e.g., levels) and/or calculating genomic sectionlevels for portions of a reference genome can be provided by a levelmodule (e.g., by an apparatus comprising a level module). In someembodiments, a level module is required to provide a level or acalculated genomic section level (e.g., according to Equation I or II).In some embodiments a level module provides a level from a fittedrelationship (e.g., a fitted linear relationship) between a GC bias andcounts of sequence reads mapped to each of the portions of a referencegenome. In some embodiments, a level module provides a genomic sectionlevel (i.e., L_(i)) according to equation L_(i)=(m_(i)−G_(i)S) I⁻¹ whereG_(i) is the GC bias, m_(i) is measured counts mapped to each portion ofa reference genome, i is a sample, and I is the intercept and S is theslope of the a fitted relationship (e.g., a fitted linear relationship)between a GC bias and counts of sequence reads mapped to each of theportions of a reference genome.

Comparison Module

A first level can be identified as significantly different from a secondlevel by a comparison module or by an apparatus comprising a comparisonmodule. In some embodiments, a comparison module or an apparatuscomprising a comparison module is required to provide a comparisonbetween two levels.

Range Setting Module

Expected ranges (e.g., expected level ranges) for various copy numbervariations (e.g., duplications, insertions and/or deletions) or rangesfor the absence of a copy number variation can be provided by a rangesetting module or by an apparatus comprising a range setting module. Incertain embodiments, expected levels are provided by a range settingmodule or by an apparatus comprising a range setting module. In someembodiments, a range setting module or an apparatus comprising a rangesetting module is required to provide expected levels and/or ranges.

Categorization Module

A copy number variation (e.g., a maternal and/or fetal copy numbervariation, a fetal copy number variation, a duplication, insertion,deletion) can be categorized by a categorization module or by anapparatus comprising a categorization module. In certain embodiments acopy number variation (e.g., a maternal and/or fetal copy numbervariation) is categorized by a categorization module. In certainembodiments a level (e.g., a first level) determined to be significantlydifferent from another level (e.g., a second level) is identified asrepresentative of a copy number variation by a categorization module. Incertain embodiments the absence of a copy number variation is determinedby a categorization module. In some embodiments, a determination of acopy number variation can be determined by an apparatus comprising acategorization module. A categorization module can be specialized forcategorizing a maternal and/or fetal copy number variation, a fetal copynumber variation, a duplication, deletion or insertion or lack thereofor combination of the foregoing. For example, a categorization modulethat identifies a maternal deletion can be different than and/ordistinct from a categorization module that identifies a fetalduplication. In some embodiments, a categorization module or anapparatus comprising a categorization module is required to identify acopy number variation or an outcome determinative of a copy numbervariation.

Plotting Module

In some embodiments a plotting module processes and/or transforms dataand/or information into a suitable visual medium, non-limiting examplesof which include a chart, plot, graph, the like or combinations thereof.In some embodiments, a plotting module processes, transforms and/ortransfers data and/or information for presentation on a suitable display(e.g., a monitor, LED, LCD, CRT, the like or combinations thereof), aprinter, a suitable peripheral or device. In certain embodiments aplotting module provides a visual display of a count, a level, and/or aprofile. In some embodiments, a data display organization moduleprocesses, transforms data and/or information into a visualrepresentation of a fetal or maternal genome, chromosome or partthereof. In some embodiments, a plotting module or an apparatuscomprising a plotting module is required to plot a count, a level or aprofile.

Relationship Module

In certain embodiments, a relationship module processes and/ortransforms data and/or information into a relationship. In certainembodiments, a relationship is generated by and/or transferred from arelationship module.

Outcome Module

The presence or absence of a genetic variation (an aneuploidy, a fetalaneuploidy, a copy number variation, a microdeletion, amicroduplication) is, in some embodiments, identified by an outcomemodule or by an apparatus comprising an outcome module. In certainembodiments a genetic variation is identified by an outcome module.Often a determination of the presence or absence of an aneuploidy, amicrodeletion, and/or a microduplication is identified by an outcomemodule. In some embodiments, an outcome determinative of a geneticvariation (an aneuploidy, a copy number variation, a microdeletion, amicroduplication) can be identified by an outcome module or by anapparatus comprising an outcome module. An outcome module can bespecialized for determining a specific genetic variation (e.g., atrisomy, a trisomy 21, a trisomy 18, certain microdeletions ormicroduplications). For example, an outcome module that identifies atrisomy 21 can be different than and/or distinct from an outcome modulethat identifies a trisomy 18. In some embodiments, an outcome module oran apparatus comprising an outcome module is required to identify agenetic variation or an outcome determinative of a genetic variation(e.g., an aneuploidy, a copy number variation, a microdeletion, amicroduplication). In some embodiments, an outcome module can identifywhether a genetic variation (e.g., microduplication or microdeletion) isa maternal genetic variation or a fetal genetic variation. A geneticvariation or an outcome determinative of a genetic variation identifiedby methods described herein can be independently verified by furthertesting (e.g., by targeted sequencing of maternal and/or fetal nucleicacid).

Transformations

As noted above, data sometimes is transformed from one form into anotherform. The terms “transformed”, “transformation”, and grammaticalderivations or equivalents thereof, as used herein refer to analteration of data from a physical starting material (e.g., test subjectand/or reference subject sample nucleic acid) into a digitalrepresentation of the physical starting material (e.g., sequence readdata), and in some embodiments includes a further transformation intoone or more numerical values or graphical representations of the digitalrepresentation that can be utilized to provide an outcome. In certainembodiments, the one or more numerical values and/or graphicalrepresentations of digitally represented data can be utilized torepresent the appearance of a test subject's physical genome (e.g.,virtually represent or visually represent the presence or absence of agenomic insertion, duplication or deletion; represent the presence orabsence of a variation in the physical amount of a sequence associatedwith medical conditions). A virtual representation sometimes is furthertransformed into one or more numerical values or graphicalrepresentations of the digital representation of the starting material.These methods can transform physical starting material into a numericalvalue or graphical representation, or a representation of the physicalappearance of a test subject's genome.

In some embodiments, transformation of a data set facilitates providingan outcome by reducing data complexity and/or data dimensionality. Dataset complexity sometimes is reduced during the process of transforming aphysical starting material into a virtual representation of the startingmaterial (e.g., sequence reads representative of physical startingmaterial). A suitable feature or variable can be utilized to reduce dataset complexity and/or dimensionality. Non-limiting examples of featuresthat can be chosen for use as a target feature for data processinginclude GC content, fetal gender prediction, identification ofchromosomal aneuploidy, identification of particular genes or proteins,identification of cancer, diseases, inherited genes/traits, chromosomalabnormalities, a biological category, a chemical category, a biochemicalcategory, a category of genes or proteins, a gene ontology, a proteinontology, co-regulated genes, cell signaling genes, cell cycle genes,proteins pertaining to the foregoing genes, gene variants, proteinvariants, co-regulated genes, co-regulated proteins, amino acidsequence, nucleotide sequence, protein structure data and the like, andcombinations of the foregoing. Non-limiting examples of data setcomplexity and/or dimensionality reduction include; reduction of aplurality of sequence reads to profile plots, reduction of a pluralityof sequence reads to numerical values (e.g., normalized values,Z-scores, p-values); reduction of multiple analysis methods toprobability plots or single points; principle component analysis ofderived quantities; and the like or combinations thereof.

Certain methods described herein may be performed in conjunction withmethods described, for example in International Patent ApplicationPublication No. WO2013/052907, International Patent ApplicationPublication No. WO2013/055817, International Patent ApplicationPublication No. WO2013/109981, International Patent ApplicationPublication No. WO2013/177086, International Patent ApplicationPublication No. WO2013/192562, International Patent ApplicationPublication No. WO2014/116598, International Patent ApplicationPublication No. WO2014/055774, International Patent ApplicationPublication No. WO2014/190286, International Patent ApplicationPublication No. WO2014/205401, and International Patent ApplicationPublication No. WO2015/051163, each of which is incorporated byreference herein in its entirety.

EXAMPLES

The examples set forth below illustrate certain embodiments and do notlimit the technology.

Example 1: Limit of Detection for Arbitrary SizeMicrodeletion/Microduplication

For extended content research (e.g., determining the presence or absenceof a microdeletion or microduplication), one aspect is to understand thelimit of detection (LoD) for an arbitrary size event. Four factors mayaffect the LoD of a given event: fetal fraction, event size, sequencingdepth, and the genomic location of the event. Regarding the first threefactors, an event generally can be more readily detected with higherfetal fraction, larger size and deeper sequencing. The fourth factorgenerally relates to the nature of sequencing: certain regions can bemore unstable than others due to various reasons (e.g., guanine andcytosine (GC) bias, repetitive elements, mappability, and the like), andtherefore events in certain regions can be more difficult to detect. Inthis example, a theoretical framework was developed to incorporate theaforementioned factors and derive an analytic formula to calculate theLoD.

Throughout the study, the event location and size were known (i.e.,targeted detection). Also, samples were selected for which the event ofinterest was trisomic for the child and euploid for the mother, althoughthe theory can be extended to other maternal-fetal ploidy combinations.

Independent Bin Count

Assume the genome has been portioned into N total equally spacednon-overlapping bins. For an unaffected euploid sample, let x_(i)represent the bin count for bin i∈{1, . . . , N_(total)}. Due to theinherent randomness in sequencing, x_(i) is a random variable withx_(i)˜f(μ_(x) _(i) ,σ_(x) _(i) ), where μ_(x) _(i) and σ_(x) _(i) arethe mean and standard deviation for bin i and f(⋅) is some distributionfunction. Similarly, for an affected trisomy sample, the bin count forthe trisomy region is y_(i)˜f(μ_(y) _(i) ,σ_(y) _(i) ), where μ_(y) _(i)=μ_(x) _(i) (1+f/2) and f is the fetal fraction. Note that the bin countvariability is location specific, which accounts for the fourth factordescribed above.

For an event with size N spanning bin p to q, define the summed bincount X

Σ_(p) ^(q)x_(i) and Y

Σ_(p) ^(q)y_(i). Assuming bin counts are independent, E[X]=Σ_(p)^(q)μ_(x) _(i)

Nμ_(x), sd[X]=√{square root over (Σ_(p) ^(q)σ_(x) _(i) ²)}

√{square root over (N)}σ_(x), E[Y]=Σ_(p) ^(q)μ_(y) _(i)

Nμ_(y) and sd[Y]=√{square root over (Σ_(p) ^(q)σ_(y) _(i) ²)}

√{square root over (N)}σ_(y), where μ_(x), σ_(x), μ_(y) and σ_(y)represent the average bin count mean and standard deviation for theevent region.

The z-score of the event can be calculated as

$\begin{matrix}{Z = \frac{Y - {E\lbrack X\rbrack}}{s{d\lbrack X\rbrack}}} & (1)\end{matrix}$

It follows that

${{E\lbrack Z\rbrack} = {\sqrt{N}\frac{\mu_{x}}{\sigma_{x}}\frac{f}{2}}},{{{sd}\lbrack Z\rbrack} = {\frac{\sigma_{y}}{\sigma_{x}} \approx {1.}}}$

Furthermore, by central limit theorem, Z can be approximated by a normaldistribution if the number of bins N is reasonably large. In sum, wehave derived the distribution of z-scores for an arbitrary size event

$\begin{matrix}{Z \sim {{Normal}\mspace{14mu}\left( {{\sqrt{N}\frac{\mu_{x}}{\sigma_{x}}\frac{f}{2}},\ 1} \right)}} & (2)\end{matrix}$

where N is the number bins for the event, μ_(x) and σ_(x) are theaverage bin count mean and standard deviation for the event region ineuploid samples, and f is the fetal fraction.

At a given false negative rate a, the minimum fetal fraction needed canbe derived from Eq. (2)

$\begin{matrix}{f = \frac{2\left( {Z_{\alpha} + c} \right)\frac{\sigma_{x}}{\mu_{x}}}{\sqrt{N}}} & (3)\end{matrix}$

Where c is the predefined z-score cutoff for euploids (e.g. c=3 for T21detection), and Z_(α) is the critical value for standard normaldistribution with a tail probability of α.

Dependent Bin Count

The method described above assumes that bins are independent; however,weak correlations sometimes are observed among the different bins fornormalized data. In such case, sd[X]≠√{square root over (Σ_(p) ^(q)σ_(x)_(i) ²)}. To calculate the LoD, redefine E[X]

μ_(X), sd[X]

σ_(X), E[Y]

μ_(Y)=μ_(X)(1+f/2) and sd [Y]

σ_(Y). Assuming σ_(Y)≈σ_(X), the z-score distribution can be rewrittenas:

$\begin{matrix}{Z \sim {{Normal}\mspace{14mu}\left( {{\frac{\mu_{X}}{\sigma_{X}}\frac{f}{2}},1} \right)}} & (4)\end{matrix}$

and at a given false negative rate a, the minimum fetal fraction is

$\begin{matrix}{f = {2\left( {Z_{\alpha} + c} \right)\frac{\sigma_{X}}{\mu_{X}}}} & (5)\end{matrix}$

Note that μ_(X) and σ_(X) can be empirically evaluated from a large poolof euploid samples.

Population Level Limit of Detection Calculation

The population level false negative (FN) rate can be calculated by thefollowing equation

FN _(total)=∫_(f=0) ¹ FN(f)p(f)df  (6)

where FN(f) represent the FN rate at a given f (shaded area on FIG. 6),and p(f) represent the fetal fraction distribution. Therefore, to ensurea population level FN rate of α, one can solve f₀ for the followingequation

α=ƒ_(f=f) ₀ ¹ FN(f)p(f)df  (7)

Results

The table in FIG. 8 shows a list of common microdeletion syndromes.Given this table, Equation (7) was applied to calculate the minimumfetal fractions to achieve a population level FN rate of 1% at differentsequencing coverage (FIG. 7). In general, the minimum required fetalfraction is lower for larger syndromes and higher sequencing coverage.

Example 2: Optimal Discretization of Sequencing Data

This example provides methods for deriving optimal discretization ofsequencing data for non-invasive prenatal testing (NIPT). For certaintypes of molecular diagnostics, copy number variations are detected fora subject. In certain instances, a nucleotide sequencing methodology(e.g., massively parallel sequencing) is applied to genomic material.One parameter of nucleotide sequencing is sequencing depth. Certaintypes of diagnostics and certain types of copy number variations mayrequire different levels of sequencing depth. Generally, a lowersequencing depth can be used to detect larger copy number variations. Incertain instances for which the analyte corresponds to a mixture ofgenomic material, such as for non-invasive prenatal testing (NIPT), thesequencing depth required to achieve a certain level of detection ofcopy number variations (CNVs) can related to the properties of themixture.

Sequencing data can reflect properties of the original genomic material,and also can be influenced by various steps in the experimental process(e.g., preferential amplification of certain regions as a function ofnucleotide composition). Features introduced by an experimental processsometimes are referred to as “bias”. Quantifying these features can bean important step in an overall quantification scheme for the assessmentof copy number variations (CNVs). Such features can be correlated withother properties of a reference data set, such as with the nucleotidecomposition of a reference genome. Depending on the sequencing depth(and, possibly, on the complexity of the analyte), sequencing data mightbe grouped or aggregated by specific regions from a reference dataset(such as regions from a reference genome). The discretization of genomicdata (also referred to as “grouping”, “aggregation”, “segmenting”,“partitioning” or “binning”) typically uses regions of predetermined andequal length. In this example, methods are presented for performing moreadvanced groupings of sequencing data.

Discretization Using Wavelet GC Binning

As noted above, one factor which can influence the ability to determinecopy number variations is variability in nucleotide composition as afunction of a genomic region of interest. One method for optimaldiscretization of sequencing data is one which directly accounts forvariability in nucleotide composition. Using a reference genome, anaverage nucleotide composition (such as, e.g., GC content) can beestimated at fixed resolution (e.g., 1 kb size) and then analyzed in thecontext of an extended genomic region, including an entire genome. Inthis example, GC content was estimated for a reference genome at aresolution of 1 kb. A transformation of these data can be applied, sucha wavelet transformation with a Haar wavelet basis, using apre-established level of decomposition. The proper choice ofdecomposition level can depend on the length of the chromosome L_(chr)and the minimum desired bin size L_(min). The wavelet decompositionlevel C can be computed by C=log₂(L_(chr)/L_(min)). In certaininstances, decomposition level of C+1 or C−1 also can be used.L_(min)=T_(A)*L_(Genome)/T_(C), where T_(A) is the average count per bin(e.g. 250), L_(Genome) is the length of the genome, and T_(C) is thetotal number of counts. The wavelet-transformed data, when using thisbasis, can determine the segmentation of the genomic region by which thesequencing data can be aggregated. In certain instances, a slidingwindow method, as described herein, can be applied.

After aggregating experimental data with respect to the segmentationdescribed above, an optional step is to further filter regions for whichthe observed read density has an outlier behavior. The merits of suchfiltering can be judged based on the accuracy of CNV detection estimatedfrom clinical studies.

FIG. 1 illustrates the wavelet GC binning method. The peaks/troughsrepresent GC content for 1 kb non-overlapping windows on chromosome 18and the horizontal lines represent the GC content for the waveletpartitioned bins. As shown in FIG. 1, wavelet binning generally does notproduce constant size bins (as opposed to 50 kb binning), but captureslocal GC effects by grouping bins with similar GC content together. FIG.2 illustrates bin size distribution using the wavelet method, where themajority of bins are 32 kb or 64 kb long.

After wavelet binning, samples can be normalized (e.g., LOESS, binmedian, and/or PCA) with one additional step to compensate for the sizeof the bins. FIG. 3 shows classification results for the wavelet binningmethod and the 50 kb binning method on LDTv4CE2 data. The classificationresults were identical. FIG. 4 shows truth tables for chromosomes 21, 18and 13 for the LDTv4CE2 study.

Discretization Based on Coverage Variability

Another method for optimal discretization of sequencing data can bederived using previously collected sequencing data. Variations insequencing coverage as a function of genomic location were observed forsequencing results from mixtures of maternal and fetal ccfDNA. Thisvariability may be present even when accounting for alignment artifactsor experimental factors (sometimes referred to as experimental bias).

Copy number variations can be determined by analyzing the density ofsequencing data from various genomic regions of interest. According to ageneral statistical theory of density estimation, one method is toderive a histogram. The number of windows (also known as “breaks” or“bins” or “portions”) in a histogram can vary but there are severaltheoretical methods which can be used to determine an optimal number ofwindows. Examples of these methods include Scott's rule or theFreedman-Diaconis rule. Given two genomic regions of interest whichexpand approximately equal sizes (as determined using a referencegenome) and covered, either for a particular sample or on average, by Nand, correspondingly, M reads, the ratio of optimal number of windowscovering the first region and the one covering the second region can beestimated to be equal to (M/N){circumflex over ( )}1/3. This ratio alsocan be expressed as (var₁/var₂)){circumflex over ( )}1/3, where var₁ iscoverage variability for region 1 and var₂ is coverage variability forregion 2. This ratio can be referred to as a proportionality factor.Using a predetermined value for average sequencing depth and apredetermined value for the complexity of the analyte (such a value offetal fraction of interest), a limit of detection (LoD) calculation canbe carried out to estimate an average size of CNV which can be detectedwith a predetermined accuracy. This value can be used to establish ageneral scale or resolution level for detecting CNVs in samples whichhave comparable complexity with the value used in the above calculation.Using this value and knowing the extent of a genomic region of interestover which CNVs can be investigated (which can also include an entiregenome), an overall total number of bins can be calculated. Under theconstraint of keeping this total number of bins constant and using theproportionality factors listed above, the number of windows for variousgenomic regions of interest can be calculated.

The above procedure can be applied to determine an averagediscretization. It also can be used to calculate a sample specificdiscretization by taking into account the sample specific analytecomplexity, such as fetal fraction. This complexity can be obtained fromeither other experimental data (e.g., a fetal quantifier assay) or fromthe sequencing data itself to be discretized. For the latter case, aniterative approach can be taken: starting with a predefined grid, thesample complexity is first estimated (e.g., fetal fraction can beestimated using the BFF algorithm (e.g., portion-specific fetal fractionestimation, as described herein)). A statistical calculation can then becarried out to estimate a confidence interval for this value. Either thepoint estimate for this value or the confidence interval can then beused to recalculate the optimal discretization as described above.

In addition to variability in sequencing coverage from one region toanother, in certain instances there will be variability in local analytecomplexity when considered in the context of genomic regions. Forexample, in NIPT samples where the analyte is a mixture of maternal andfetal ccfDNA, there can be differences in the fetal fraction associatedwith some genomic regions when compared to other regions. FIG. 9 shows aLOESS regression plot of normalized counts vs. enet (BFF;portion-specific fetal fraction) bin coefficients (excluding bins with 0coefficient) for three samples. Sample 1 had low fetal fraction (about5%), sample 2 had medium fetal fraction (about 10%), and sample 3 hadhigh fetal fraction (about 20%). This plot highlights differences incoverage as a function of enet (BFF; portion-specific fetal fraction)bin coefficients and fetal fraction and supports the observation thatcertain regions provide more fetal DNA than other regions. Thus, a finergrid for example, can be constructed for regions providing more fetalDNA, and a coarse grid can be constructed for regions providing lessfetal DNA.

The above described discretization method can be further refined byaccounting for region-related variability in analyte complexity byrepeating the LoD calculation using a local estimate of, e.g., fetalfraction. Sample specific discretization also can be accomplished byfirst estimating the sequencing depth (using the total number of reads)and then carrying out the calculations described above.

FIG. 10 shows certain steps that can be used in an optimaldiscretization method, and FIG. 11 shows example workflows for anoptimal discretization method using certain steps presented in FIG. 10.

Example 3: Examples of Embodiments

A1. A method for partitioning one or more genomic regions of a referencegenome into a plurality of portions comprising:

-   -   a) determining sequencing coverage variability across a        reference genome;    -   b) selecting an initial portion length;    -   c) partitioning at least two genomic regions according to the        initial portion length in (b);    -   d) comparing the sequencing coverage variability determined        in (a) for each of the at least two genomic regions, thereby        generating a comparison;    -   e) recalculating the number of portions for at least one of the        genomic regions according to the comparison in (d), thereby        determining an optimized portion length; and    -   f) re-partitioning at least one of the genomic regions into a        plurality of portions according to the optimized portion length        in (e).

A1.1 A method for identifying the presence or absence of a geneticvariation comprising quantifying nucleotide sequence reads for a testsample, which sequence reads are mapped to one or more genomic regionsof a reference genome that have been partitioned by a processcomprising:

-   -   a) determining sequencing coverage variability across a        reference genome;    -   b) selecting an initial portion length;    -   c) partitioning at least two genomic regions according to the        initial portion length in (b);    -   d) comparing the sequencing coverage variability determined        in (a) for each of the at least two genomic regions, thereby        generating a comparison;    -   e) recalculating the number of portions for at least one of the        genomic regions according to the comparison in (d), thereby        determining an optimized portion length; and    -   f) re-partitioning at least one of the genomic regions into a        plurality of portions according to the optimized portion length        in (e).

A2. The method of embodiment A1 or A1.1, wherein determining thesequencing coverage variability in (a) comprises use of a training setof nucleotide sequence reads mapped to portions of a reference genome,which sequence reads are reads of circulating cell-free nucleic acidfrom a plurality of samples from pregnant females bearing a fetus.

A3. The method of embodiment A2, wherein the initial portion length in(b) is selected according to sequencing depth for the training set.

A4. The method of embodiment A2 or A3, wherein the initial portionlength in (b) is selected according to an average fetal fraction for thetraining set.

A5. The method of embodiment A4, wherein the average fetal fraction isdetermined using the training set.

A6. The method of any one of embodiments A1 to A5, wherein the initialportion length is between about 1 kb to about 1000 kb.

A7. The method of any one of embodiments A1 to A6, wherein the initialportion length is about 30 kb.

A8. The method of any one of embodiments A1 to A6, wherein the initialportion length is about 40 kb.

A9. The method of any one of embodiments A1 to A6, wherein the initialportion length is about 50 kb.

A9.1 The method of any one of embodiments A1 to A6, wherein the initialportion length is not 50 kb.

A10. The method of any one of embodiments A1 to A6, wherein the initialportion length is about 60 kb.

A11. The method of any one of embodiments A1 to A6, wherein the initialportion length is about 70 kb.

A11.1 The method of any one of embodiments A1 to A11, wherein a totalnumber of portions for a genome is determined according to the initialportion length in (b).

A12. The method of any one of embodiments A1 to A11.1, wherein the atleast two genomic regions comprise a first genomic region and a secondgenomic region.

A13. The method of embodiment A12, wherein the first genomic region andthe second genomic region are substantially similar in size.

A14. The method of embodiment A12 or A13, wherein comparing thesequencing coverage variability in (d) comprises calculating aproportionality factor (P) according to the following equation:

P=(var₁/var₂){circumflex over ( )}1/3  equation A

wherein var₁ is the sequencing coverage variability of the first genomicregion and var₂ is the sequencing coverage variability of the secondgenomic region.

A15. The method of embodiment A14, wherein the sequencing coveragevariability of the first genomic region is determined from a nucleotidesequence read count, or a derivative thereof, for the first genomicregion, and wherein sequencing coverage variability of the secondgenomic region is determined from a nucleotide sequence read count, or aderivative thereof, for the second genomic region.

A16. The method of embodiment A14, wherein the sequencing coveragevariability of the first genomic region is determined from an averagenucleotide sequence read count, or a derivative thereof, for the firstgenomic region, and wherein sequencing coverage variability of thesecond genomic region is determined from an average nucleotide sequenceread count, or a derivative thereof, for the second genomic region.

A17. The method of embodiment A16, wherein the average nucleotidesequencing read counts for each genomic region are determined using thetraining set.

A18. The method of embodiment A15, wherein the nucleotide sequence readcount is a normalized nucleotide sequence read count.

A19. The method of embodiment A16 or A17, wherein the average nucleotidesequence read count is an average normalized nucleotide sequence readcount.

A20. The method of any one of embodiments A14 to A19, wherein therecalculating the number of portions for at least one of the genomicregion in (e) is performed according to the proportionality factor andthe total number of portions determined from the initial portion lengthin (b).

A21. The method of any one of embodiments A1 to A20, wherein theplurality of portions in (f) comprises portions of constant size.

A22. The method of any one of embodiments A1 to A20, wherein theplurality of portions in (f) comprises portions of varying size.

A23. The method of embodiment A21 or A22, wherein the plurality ofportions in (f) comprises portion lengths of between about 1 kb to about1000 kb.

A24. The method of embodiment A21 or A22, wherein the plurality ofportions in (f) comprises portions of about 30 kb.

A25. The method of embodiment A21 or A22, wherein the plurality ofportions in (f) comprises portions of about 40 kb.

A26. The method of embodiment A21 or A22, wherein the plurality ofportions in (f) comprises portions of about 50 kb.

A27. The method of embodiment A21 or A22, wherein the plurality ofportions in (f) does not comprise 50 kb portions.

A28. The method of embodiment A21 or A22, wherein the plurality ofportions in (f) comprises portions of about 60 kb.

A29. The method of embodiment A21 or A22, wherein the plurality ofportions in (f) comprises portions of about 70 kb.

A30. The method of any one of embodiments A1 to A29, comprisingsequencing nucleic acid from a test sample by a nucleotide sequencingprocess to generate nucleotide sequence reads.

A31. The method of embodiment A30, wherein the nucleic acid iscirculating cell-free nucleic acid from a pregnant female bearing afetus.

A32. The method of any one of embodiments A1 to A31, comprising mappingnucleotide sequence reads from a test sample to re-partitioned portionsof a reference genome, thereby generating mapped nucleotide sequencereads.

A33. The method of embodiment A32, comprising normalizing counts of themapped nucleotide sequence reads, thereby generating normalized counts.

A34. The method of embodiment A33, wherein the normalizing comprisesLOESS normalization of guanine and cytosine (GC) bias (GC-LOESSnormalization).

A35. The method of embodiment A33 or A34, wherein the normalizingcomprises adjusting a sequence read count according to a median count.

A35.1 The method of embodiment A35, wherein the sequence read count isadjusted according to a median portion count.

A36. The method of any one of embodiments A33 to A35.1, wherein thenormalizing comprises a principal component normalization.

A37. The method of any one of embodiments A33 to A36, wherein thenormalizing comprises GC-LOESS normalization followed by normalizationaccording to a median portion count followed by a principal componentnormalization.

A38. The method of any one of embodiments A33 to A37, comprisingdetermining the presence or absence of a genetic variation for the testsample according to the normalized counts.

A38.1 The method of any one of embodiments A33 to A38, comprisingdetermining a chromosome structure according to the normalized counts.

A38.2 The method of any one of embodiments A33 to A38.1, wherein thenormalized counts are representative of chromosome dosage for the testsample.

A38.3 The method of embodiment A38.2, wherein determining a presence orabsence of a genetic variation is according to the chromosome dosage.

A38.4 The method of any one of embodiments A38 to A38.3, whereindetermining the presence or absence of a genetic variation for the testsample comprises identifying the presence or absence of one copy of achromosome, two copies of a chromosome, three copies of a chromosome,four copies of a chromosome, five copies of a chromosome, a deletion ofone or more segments of a chromosome or an insertion of one or moresegments of chromosome.

A39. The method of embodiment A32, wherein the method does not comprisenormalizing counts of the mapped nucleotide sequence reads.

A40. The method of embodiment A39, comprising determining the presenceor absence of a genetic variation for the test sample according to rawcounts of the mapped nucleotide sequence reads.

A41. The method of embodiment A39 or A40, comprising determining achromosome structure according to raw counts of the mapped nucleotidesequence reads.

A42. The method of embodiment A40 or A41, wherein the raw counts arerepresentative of chromosome dosage for the test sample.

A43. The method of embodiment A42, wherein determining a presence orabsence of a genetic variation is according to the chromosome dosage.

A44. The method of any one of embodiments A40 to A43, whereindetermining the presence or absence of a genetic variation for the testsample comprises identifying the presence or absence of one copy of achromosome, two copies of a chromosome, three copies of a chromosome,four copies of a chromosome, five copies of a chromosome, a deletion ofone or more segments of a chromosome or an insertion of one or moresegments of chromosome.

B1. A method for partitioning one or more genomic regions of a referencegenome into a plurality of portions comprising:

-   -   a) determining sequencing coverage variability across a        reference genome;    -   b) selecting an initial portion length;    -   c) partitioning at least two genomic regions according to the        initial portion length in (b);    -   d) comparing the sequencing coverage variability determined        in (a) for each of the at least two genomic regions, thereby        generating a comparison;    -   e) recalculating the number of portions for at least one of the        genomic regions according to the comparison in (d), thereby        determining an optimized portion length;    -   f) re-partitioning at least one of the genomic regions into a        plurality of portions according to the optimized portion length        in (e), thereby generating a re-partitioned genomic region;    -   g) estimating fetal fraction for a test sample from a pregnant        female bearing a fetus;    -   h) determining a minimum genomic region size; and    -   i) adjusting the number of portions for each genomic region to        comprise at least two portions, thereby generating a refined        re-partitioned genomic region.

B1.1 A method for identifying the presence or absence of a geneticvariation comprising quantifying nucleotide sequence reads for a testsample, which sequence reads are mapped to one or more genomic regionsof a reference genome that have been partitioned by a processcomprising:

-   -   a) determining sequencing coverage variability across a        reference genome;    -   b) selecting an initial portion length;    -   c) partitioning at least two genomic regions according to the        initial portion length in (b);    -   d) comparing the sequencing coverage variability determined        in (a) for each of the at least two genomic regions, thereby        generating a comparison;    -   e) recalculating the number of portions for at least one of the        genomic regions according to the comparison in (d), thereby        determining an optimized portion length;    -   f) re-partitioning at least one of the genomic regions into a        plurality of portions according to the optimized portion length        in (e), thereby generating a re-partitioned genomic region;    -   g) estimating fetal fraction for a test sample from a pregnant        female bearing a fetus;    -   h) determining a minimum genomic region size; and    -   i) adjusting the number of portions for each genomic region to        comprise at least two portions, thereby generating a refined        re-partitioned genomic region.

B2. The method of embodiment B1 or B1.1, wherein determining thesequencing coverage variability in (a) comprises use of a training setof nucleotide sequence reads mapped to portions of a reference genome,which sequence reads are reads of circulating cell-free nucleic acidfrom a plurality of samples from pregnant females bearing a fetus.

B3. The method of embodiment B2, wherein the initial portion length in(b) is selected according to sequencing depth for the training set.

B4. The method of embodiment B2 or B3, wherein the initial portionlength in (b) is selected according to an average fetal fraction for thetraining set.

B5. The method of embodiment B4, wherein the average fetal fraction isdetermined using the training set.

B6. The method of any one of embodiments B1 to B5, wherein the initialportion length is between about 1 kb to about 1000 kb.

B7. The method of any one of embodiments B1 to B6, wherein the initialportion length is about 30 kb.

B8. The method of any one of embodiments B1 to B6, wherein the initialportion length is about 40 kb.

B9. The method of any one of embodiments B1 to B6, wherein the initialportion length is about 50 kb.

B9.1 The method of any one of embodiments B1 to B6, wherein the initialportion length is not 50 kb.

B10. The method of any one of embodiments B1 to B6, wherein the initialportion length is about 60 kb.

B11. The method of any one of embodiments B1 to B6, wherein the initialportion length is about 70 kb.

B11.1 The method of any one of embodiments B1 to B11, wherein a totalnumber of portions for a genome is determined according to the initialportion length in (b).

B12. The method of any one of embodiments B1 to B11.1, wherein the atleast two genomic regions comprise a first genomic region and a secondgenomic region.

B13. The method of embodiment B12, wherein the first genomic region andthe second genomic region are substantially similar in size.

B14. The method of embodiment B12 or B13, wherein comparing thesequencing coverage variability in (d) comprises calculating aproportionality factor (P) according to the following equation:

P=(var₁/var₂){circumflex over ( )}1/3  equation A

wherein var₁ is the sequencing coverage variability of the first genomicregion and var₂ is the sequencing coverage variability of the secondgenomic region.

B15. The method of embodiment B14, wherein the sequencing coveragevariability of the first genomic region is determined from a nucleotidesequence read count, or a derivative thereof, for the first genomicregion, and wherein sequencing coverage variability of the secondgenomic region is determined from a nucleotide sequence read count, or aderivative thereof, for the second genomic region.

B16. The method of embodiment B14, wherein the sequencing coveragevariability of the first genomic region is determined from an averagenucleotide sequence read count, or a derivative thereof, for the firstgenomic region, and wherein sequencing coverage variability of thesecond genomic region is determined from an average nucleotide sequenceread count, or a derivative thereof, for the second genomic region.

B17. The method of embodiment B16, wherein the average nucleotidesequencing read counts for each genomic region are determined using thetraining set.

B18. The method of embodiment B15, wherein the nucleotide sequence readcount is a normalized nucleotide sequence read count.

B19. The method of embodiment B16 or B17, wherein the average nucleotidesequence read count is an average normalized nucleotide sequence readcount.

B20. The method of any one of embodiments B14 to B19, wherein therecalculating the number of portions for at least one of the genomicregion in (e) is performed according to the proportionality factor andthe total number of portions determined from the initial portion lengthin (b).

B21. The method of any one of embodiments B1 to B20, wherein theplurality of portions in (f) comprises portions of constant size.

B22. The method of any one of embodiments B1 to B20, wherein theplurality of portions in (f) comprises portions of varying size.

B23. The method of embodiment B21 or B22, wherein the plurality ofportions in (f) comprises portion lengths of between about 1 kb to about1000 kb.

B24. The method of embodiment B21 or B22, wherein the plurality ofportions in (f) comprises portions of about 30 kb.

B25. The method of embodiment B21 or B22, wherein the plurality ofportions in (f) comprises portions of about 40 kb.

B26. The method of embodiment B21 or B22, wherein the plurality ofportions in (f) comprises portions of about 50 kb.

B27. The method of embodiment B21 or B22, wherein the plurality ofportions in (f) does not comprise 50 kb portions.

B28. The method of embodiment B21 or B22, wherein the plurality ofportions in (f) comprises portions of about 60 kb.

B29. The method of embodiment B21 or B22, wherein the plurality ofportions in (f) comprises portions of about 70 kb.

B30. The method of any one of embodiments B1 to B29, wherein estimatingfetal fraction in (g) comprises determining an error value.

B31. The method of any one of embodiments B1 to B30, wherein determininga minimum genomic region size in (h) comprises determining a minimumgenomic region size that is detectable for a sample having a fetalfraction as estimated in (g).

B32. The method of embodiment B31, wherein a minimum genomic region sizeis determined according to an upper 95% confidence interval of fetalfraction.

B33. The method of any one of embodiments B1 to B32, further comprising(j) re-estimating fetal fraction from the refined re-partitioned genomicregion.

B34. The method of embodiment B33, comprising comparing the estimatedfetal fraction in (g) to the re-estimated fetal fraction in (j).

B35. The method of embodiment B34, comprising repeating parts (g), (h)and (i) when the estimated fetal fraction in (g) differs from there-estimated fetal fraction in (j) by a predetermined tolerance value.

B35.1 The method of embodiment B35, wherein the predetermined tolerancevalue is between about 1% to about 25%.

B36. The method of any one of embodiments B1 to B35.1, comprisingsequencing nucleic acid from a test sample by a nucleotide sequencingprocess to generate nucleotide sequence reads.

B37. The method of embodiment B36, wherein the nucleic acid iscirculating cell-free nucleic acid from a pregnant female bearing afetus.

B38. The method of any one of embodiments B1 to B37, comprising mappingnucleotide sequence reads from a test sample to refined re-partitionedportions of a reference genome, thereby generating mapped nucleotidesequence reads.

B39. The method of embodiment B38, comprising normalizing counts of themapped nucleotide sequence reads, thereby generating normalized counts.

B40. The method of embodiment B39, wherein the normalizing comprisesLOESS normalization of guanine and cytosine (GC) bias (GC-LOESSnormalization).

B41. The method of embodiment B39 or B40, wherein the normalizingcomprises adjusting a sequence read count according to a median count.

B41.1 The method of embodiment B41, wherein the sequence read count isadjusted according to a median portion count.

B42. The method of any one of embodiments B39 to B41.1, wherein thenormalizing comprises a principal component normalization.

B43. The method of any one of embodiments B39 to B42, wherein thenormalizing comprises GC-LOESS normalization followed by normalizationaccording to a median portion count followed by a principal componentnormalization.

B44. The method of any one of embodiments B39 to B43, comprisingdetermining the presence or absence of a genetic variation for the testsample according to the normalized counts.

B44.1 The method of any one of embodiments B39 to B44, comprisingdetermining a chromosome structure according to the normalized counts.

B44.2 The method of any one of embodiments B39 to B44.1, wherein thenormalized counts are representative of chromosome dosage for the testsample.

B44.3 The method of embodiment B44.2, wherein determining a presence orabsence of a genetic variation is according to the chromosome dosage.

B44.4 The method of any one of embodiments B44 to B44.3, whereindetermining the presence or absence of a genetic variation for the testsample comprises identifying the presence or absence of one copy of achromosome, two copies of a chromosome, three copies of a chromosome,four copies of a chromosome, five copies of a chromosome, a deletion ofone or more segments of a chromosome or an insertion of one or moresegments of chromosome.

B45. The method of embodiment B38, wherein the method does not comprisenormalizing counts of the mapped nucleotide sequence reads.

B46. The method of embodiment B45, comprising determining the presenceor absence of a genetic variation for the test sample according to rawcounts of the mapped nucleotide sequence reads.

B47. The method of embodiment B45 of B46, comprising determining achromosome structure according to raw counts of the mapped nucleotidesequence reads.

B48. The method of embodiment B46 or B47, wherein the raw counts arerepresentative of chromosome dosage for the test sample.

B49. The method of embodiment B48, wherein determining a presence orabsence of a genetic variation is according to the chromosome dosage.

B50. The method of any one of embodiments B46 to B49, whereindetermining the presence or absence of a genetic variation for the testsample comprises identifying the presence or absence of one copy of achromosome, two copies of a chromosome, three copies of a chromosome,four copies of a chromosome, five copies of a chromosome, a deletion ofone or more segments of a chromosome or an insertion of one or moresegments of chromosome.

C1. A method for partitioning one or more genomic regions of a referencegenome into a plurality of portions comprising:

-   -   a) determining sequencing coverage variability across a        reference genome;    -   b) selecting an initial portion length;    -   c) partitioning at least two genomic regions according to the        initial portion length in (b);    -   d) comparing the sequencing coverage variability determined        in (a) for each of the at least two genomic regions, thereby        generating a comparison;    -   e) recalculating the number of portions for at least one of the        genomic regions according to the comparison in (d), thereby        determining an optimized portion length;    -   f) re-partitioning at least one of the genomic regions into a        plurality of portions according to the optimized portion length        in (e), thereby generating a re-partitioned genomic region;    -   g) determining a region-specific fetal fraction for each genomic        region according to a correlation between nucleotide sequence        read counts per portion and a weighting factor;    -   h) determining a local minimum genomic region size; and    -   i) adjusting the number of portions for each genomic region to        comprise at least two portions, thereby generating a refined        re-partitioned genomic region.

C1.1 A method for identifying the presence or absence of a geneticvariation comprising quantifying nucleotide sequence reads for a testsample, which sequence reads are mapped to one or more genomic regionsof a reference genome that have been partitioned by a processcomprising:

-   -   a) determining sequencing coverage variability across a        reference genome;    -   b) selecting an initial portion length;    -   c) partitioning at least two genomic regions according to the        initial portion length in (b);    -   d) comparing the sequencing coverage variability determined        in (a) for each of the at least two genomic regions, thereby        generating a comparison;    -   e) recalculating the number of portions for at least one of the        genomic regions according to the comparison in (d), thereby        determining an optimized portion length;    -   f) re-partitioning at least one of the genomic regions into a        plurality of portions according to the optimized portion length        in (e), thereby generating a re-partitioned genomic region;    -   g) determining a region-specific fetal fraction for each genomic        region according to a correlation between nucleotide sequence        read counts per portion and a weighting factor;    -   h) determining a local minimum genomic region size; and    -   i) adjusting the number of portions for each genomic region to        comprise at least two portions, thereby generating a refined        re-partitioned genomic region.

C2. The method of embodiment C1 or C1.1, wherein determining thesequencing coverage variability in (a) comprises use of a training setof nucleotide sequence reads mapped to portions of a reference genome,which sequence reads are reads of circulating cell-free nucleic acidfrom a plurality of samples from pregnant females bearing a fetus.

C3. The method of embodiment C2, wherein the initial portion length in(b) is selected according to sequencing depth for the training set.

C4. The method of embodiment C2 or C3, wherein the initial portionlength in (b) is selected according to an average fetal fraction for thetraining set.

C5. The method of embodiment C4, wherein the average fetal fraction isdetermined using the training set.

C6. The method of any one of embodiments C1 to C5, wherein the initialportion length is between about 1 kb to about 1000 kb.

C7. The method of any one of embodiments C1 to C6, wherein the initialportion length is about 30 kb.

C8. The method of any one of embodiments C1 to C6, wherein the initialportion length is about 40 kb.

C9. The method of any one of embodiments C1 to C6, wherein the initialportion length is about 50 kb.

C9.1 The method of any one of embodiments C1 to C6, wherein the initialportion length is not 50 kb.

C10. The method of any one of embodiments C1 to C6, wherein the initialportion length is about 60 kb.

C11. The method of any one of embodiments C1 to C6, wherein the initialportion length is about 70 kb.

C11.1 The method of any one of embodiments C1 to C11, wherein a totalnumber of portions for a genome is determined according to the initialportion length in (b).

C12. The method of any one of embodiments C1 to C11.1, wherein the atleast two genomic regions comprise a first genomic region and a secondgenomic region.

C13. The method of embodiment C12, wherein the first genomic region andthe second genomic region are substantially similar in size.

C14. The method of embodiment C12 or C13, wherein comparing thesequencing coverage variability in (d) comprises calculating aproportionality factor (P) according to the following equation:

P=(var₁/var₂){circumflex over ( )}1/3  equation A

wherein var₁ is the sequencing coverage variability of the first genomicregion and var₂ is the sequencing coverage variability of the secondgenomic region.

C15. The method of embodiment C14, wherein the sequencing coveragevariability of the first genomic region is determined from a nucleotidesequence read count, or a derivative thereof, for the first genomicregion, and wherein sequencing coverage variability of the secondgenomic region is determined from a nucleotide sequence read count, or aderivative thereof, for the second genomic region.

C16. The method of embodiment C14, wherein the sequencing coveragevariability of the first genomic region is determined from an averagenucleotide sequence read count, or a derivative thereof, for the firstgenomic region, and wherein sequencing coverage variability of thesecond genomic region is determined from an average nucleotide sequenceread count, or a derivative thereof, for the second genomic region.

C17. The method of embodiment C16, wherein the average nucleotidesequencing read counts for each genomic region are determined using thetraining set.

C18. The method of embodiment C15, wherein the nucleotide sequence readcount is a normalized nucleotide sequence read count.

C19. The method of embodiment C16 or C17, wherein the average nucleotidesequence read count is an average normalized nucleotide sequence readcount.

C20. The method of any one of embodiments C14 to C19, wherein therecalculating the number of portions for at least one of the genomicregion in (e) is performed according to the proportionality factor andthe total number of portions determined from the initial portion lengthin (b).

C21. The method of any one of embodiments C1 to C20, wherein theplurality of portions in (f) comprises portions of constant size.

C22. The method of any one of embodiments C1 to C20, wherein theplurality of portions in (f) comprises portions of varying size.

C23. The method of embodiment C21 or C22, wherein the plurality ofportions in (f) comprises portion lengths of between about 1 kb to about1000 kb.

C24. The method of embodiment C21 or C22, wherein the plurality ofportions in (f) comprises portions of about 30 kb.

C25. The method of embodiment C21 or C22, wherein the plurality ofportions in (f) comprises portions of about 40 kb.

C26. The method of embodiment C21 or C22, wherein the plurality ofportions in (f) comprises portions of about 50 kb.

C27. The method of embodiment C21 or C22, wherein the plurality ofportions in (f) does not comprise 50 kb portions.

C28. The method of embodiment C21 or C22, wherein the plurality ofportions in (f) comprises portions of about 60 kb.

C29. The method of embodiment C21 or C22, wherein the plurality ofportions in (f) comprises portions of about 70 kb.

C30. The method of any one of embodiments C1 to C29, wherein determininga local minimum genomic region size in (h) comprises determining a localgenomic region size that is detectable for a sample having an averagefetal fraction.

C31. The method of any one of embodiments C1 to C30, comprisingsequencing nucleic acid from a test sample by a nucleotide sequencingprocess to generate nucleotide sequence reads.

C32. The method of embodiment C31, wherein the nucleic acid iscirculating cell-free nucleic acid from a pregnant female bearing afetus.

C33. The method of any one of embodiments C1 to C32, comprising mappingnucleotide sequence reads from a test sample to refined re-partitionedportions of a reference genome, thereby generating mapped nucleotidesequence reads.

C34. The method of embodiment C33, comprising normalizing counts of themapped nucleotide sequence reads, thereby generating normalized counts.

C35. The method of embodiment C34, wherein the normalizing comprisesLOESS normalization of guanine and cytosine (GC) bias (GC-LOESSnormalization).

C36. The method of embodiment C34 or C35, wherein the normalizingcomprises adjusting a sequence read count according to a median count.

C36.1 The method of embodiment C36, wherein the sequence read count isadjusted according to a median portion count.

C37. The method of any one of embodiments C34 to C36.1, wherein thenormalizing comprises a principal component normalization.

C38. The method of any one of embodiments C34 to C37, wherein thenormalizing comprises GC-LOESS normalization followed by normalizationaccording to a median portion count followed by a principal componentnormalization.

C39. The method of any one of embodiments C34 to C38, comprisingdetermining the presence or absence of a genetic variation for the testsample according to the normalized counts.

C39.1 The method of any one of embodiments C34 to C39, comprisingdetermining a chromosome structure according to the normalized counts.

C39.2 The method of any one of embodiments C34 to C39.1, wherein thenormalized counts are representative of chromosome dosage for the testsample.

C39.3 The method of embodiment C39.2, wherein determining a presence orabsence of a genetic variation is according to the chromosome dosage.

C39.4 The method of any one of embodiments C39 to C39.3, whereindetermining the presence or absence of a genetic variation for the testsample comprises identifying the presence or absence of one copy of achromosome, two copies of a chromosome, three copies of a chromosome,four copies of a chromosome, five copies of a chromosome, a deletion ofone or more segments of a chromosome or an insertion of one or moresegments of chromosome.

C40. The method of embodiment C33, wherein the method does not comprisenormalizing counts of the mapped nucleotide sequence reads.

C41. The method of embodiment C40, comprising determining the presenceor absence of a genetic variation for the test sample according to rawcounts of the mapped nucleotide sequence reads.

C42. The method of embodiment C40 or C41, comprising determining achromosome structure according to raw counts of the mapped nucleotidesequence reads.

C43. The method of embodiment C41 or C42, wherein the raw counts arerepresentative of chromosome dosage for the test sample.

C44. The method of embodiment C43, wherein determining a presence orabsence of a genetic variation is according to the chromosome dosage.

C45. The method of any one of embodiments C41 to C44, whereindetermining the presence or absence of a genetic variation for the testsample comprises identifying the presence or absence of one copy of achromosome, two copies of a chromosome, three copies of a chromosome,four copies of a chromosome, five copies of a chromosome, a deletion ofone or more segments of a chromosome or an insertion of one or moresegments of chromosome.

D1. A method for partitioning one or more genomic regions of a referencegenome into a plurality of portions comprising:

-   -   a) determining sequencing coverage variability across a        reference genome;    -   b) selecting an initial portion length;    -   c) partitioning at least two genomic regions according to the        initial portion length in (b);    -   d) comparing the sequencing coverage variability determined        in (a) for each of the at least two genomic regions, thereby        generating a comparison;    -   e) recalculating the number of portions for at least one of the        genomic regions according to the comparison in (d), thereby        determining an optimized portion length;    -   f) re-partitioning at least one of the genomic regions into a        plurality of portions according to the optimized portion length        in (e), thereby generating a re-partitioned genomic region;    -   g) estimating fetal fraction for a test sample from a pregnant        female bearing a fetus;    -   h) determining a region-specific fetal fraction for each genomic        region according to a correlation between nucleotide sequence        read counts per portion and a weighting factor;    -   i) determining a local minimum genomic region size; and    -   j) adjusting the number of portions for each genomic region to        comprise at least two portions, thereby generating a refined        re-partitioned genomic region.

D1.1 A method for identifying the presence or absence of a geneticvariation comprising quantifying nucleotide sequence reads for a testsample, which sequence reads are mapped to one or more genomic regionsof a reference genome that have been partitioned by a processcomprising:

-   -   a) determining sequencing coverage variability across a        reference genome;    -   b) selecting an initial portion length;    -   c) partitioning at least two genomic regions according to the        initial portion length in (b);    -   d) comparing the sequencing coverage variability determined        in (a) for each of the at least two genomic regions, thereby        generating a comparison;    -   e) recalculating the number of portions for at least one of the        genomic regions according to the comparison in (d), thereby        determining an optimized portion length;    -   f) re-partitioning at least one of the genomic regions into a        plurality of portions according to the optimized portion length        in (e), thereby generating a re-partitioned genomic region;    -   g) estimating fetal fraction for a test sample from a pregnant        female bearing a fetus;    -   h) determining a region-specific fetal fraction for each genomic        region according to a correlation between nucleotide sequence        read counts per portion and a weighting factor;    -   i) determining a local minimum genomic region size; and    -   j) adjusting the number of portions for each genomic region to        comprise at least two portions, thereby generating a refined        re-partitioned genomic region.

D2. The method of embodiment D1 or D1.1, wherein determining thesequencing coverage variability in (a) comprises use of a training setof nucleotide sequence reads mapped to portions of a reference genome,which sequence reads are reads of circulating cell-free nucleic acidfrom a plurality of samples from pregnant females bearing a fetus.

D3. The method of embodiment D2, wherein the initial portion length in(b) is selected according to sequencing depth for the training set.

D4. The method of embodiment D2 or D3, wherein the initial portionlength in (b) is selected according to an average fetal fraction for thetraining set.

D5. The method of embodiment D4, wherein the average fetal fraction isdetermined using the training set.

D6. The method of any one of embodiments D1 to D5, wherein the initialportion length is between about 1 kb to about 1000 kb.

D7. The method of any one of embodiments D1 to D6, wherein the initialportion length is about 30 kb.

D8. The method of any one of embodiments D1 to D6, wherein the initialportion length is about 40 kb.

D9. The method of any one of embodiments D1 to D6, wherein the initialportion length is about 50 kb.

D9.1 The method of any one of embodiments D1 to D6, wherein the initialportion length is not 50 kb.

D10. The method of any one of embodiments D1 to D6, wherein the initialportion length is about 60 kb.

D11. The method of any one of embodiments D1 to D6, wherein the initialportion length is about 70 kb.

D11.1 The method of any one of embodiments D1 to D11, wherein a totalnumber of portions for a genome is determined according to the initialportion length in (b).

D12. The method of any one of embodiments D1 to D11.1, wherein the atleast two genomic regions comprise a first genomic region and a secondgenomic region.

D13. The method of embodiment D12, wherein the first genomic region andthe second genomic region are substantially similar in size.

D14. The method of embodiment D12 or D13, wherein comparing thesequencing coverage variability in (d) comprises calculating aproportionality factor (P) according to the following equation:

P=(var₁/var₂){circumflex over ( )}1/3  equation A

wherein var₁ is the sequencing coverage variability of the first genomicregion and var₂ is the sequencing coverage variability of the secondgenomic region.

D15. The method of embodiment D14, wherein the sequencing coveragevariability of the first genomic region is determined from a nucleotidesequence read count, or a derivative thereof, for the first genomicregion, and wherein sequencing coverage variability of the secondgenomic region is determined from a nucleotide sequence read count, or aderivative thereof, for the second genomic region.

D16. The method of embodiment D14, wherein the sequencing coveragevariability of the first genomic region is determined from an averagenucleotide sequence read count, or a derivative thereof, for the firstgenomic region, and wherein sequencing coverage variability of thesecond genomic region is determined from an average nucleotide sequenceread count, or a derivative thereof, for the second genomic region.

D17. The method of embodiment D16, wherein the average nucleotidesequencing read counts for each genomic region are determined using thetraining set.

D18. The method of embodiment D15, wherein the nucleotide sequence readcount is a normalized nucleotide sequence read count.

D19. The method of embodiment D16 or D17, wherein the average nucleotidesequence read count is an average normalized nucleotide sequence readcount.

D20. The method of any one of embodiments D14 to D19, wherein therecalculating the number of portions for at least one of the genomicregion in (e) is performed according to the proportionality factor andthe total number of portions determined from the initial portion lengthin (b).

D21. The method of any one of embodiments D1 to D20, wherein theplurality of portions in (f) comprises portions of constant size.

D22. The method of any one of embodiments D1 to D20, wherein theplurality of portions in (f) comprises portions of varying size.

D23. The method of embodiment D21 or D22, wherein the plurality ofportions in (f) comprises portion lengths of between about 1 kb to about1000 kb.

D24. The method of embodiment D21 or D22, wherein the plurality ofportions in (f) comprises portions of about 30 kb.

D25. The method of embodiment D21 or D22, wherein the plurality ofportions in (f) comprises portions of about 40 kb.

D26. The method of embodiment D21 or D22, wherein the plurality ofportions in (f) comprises portions of about 50 kb.

D27. The method of embodiment D21 or D22, wherein the plurality ofportions in (f) does not comprise 50 kb portions.

D28. The method of embodiment D21 or D22, wherein the plurality ofportions in (f) comprises portions of about 60 kb.

D29. The method of embodiment D21 or D22, wherein the plurality ofportions in (f) comprises portions of about 70 kb.

D30. The method of any one of embodiments D1 to D29, wherein estimatingfetal fraction in (g) comprises determining an error value.

D31. The method of any one of embodiments D1 to D30, wherein determininga local minimum genomic region size in (i) comprises determining aminimum local genomic region size that is detectable for a sample havinga fetal fraction as estimated in (g).

D32. The method of embodiment D31, wherein a local minimum genomicregion size is determined according to an upper 95% confidence intervalof fetal fraction.

D33. The method of any one of embodiments D1 to D32, further comprising(k) re-estimating fetal fraction from the refined re-partitioned genomicregion.

D34. The method of embodiment D33, comprising comparing the estimatedfetal fraction in (g) to the re-estimated fetal fraction in (k).

D35. The method of embodiment D34, comprising repeating parts (g), (h),(i) and (j) when the estimated fetal fraction in (g) differs from there-estimated fetal fraction in (k) by a predetermined tolerance value.

D35.1 The method of embodiment D35, wherein the predetermined tolerancevalue is between about 1% to about 25%.

D36. The method of any one of embodiments D1 to D35.1, comprisingsequencing nucleic acid from a test sample by a nucleotide sequencingprocess to generate nucleotide sequence reads.

D37. The method of embodiment D36, wherein the nucleic acid iscirculating cell-free nucleic acid from a pregnant female bearing afetus.

D38. The method of any one of embodiments D1 to D37, comprising mappingnucleotide sequence reads from a test sample to refined re-partitionedportions of a reference genome, thereby generating mapped nucleotidesequence reads.

D39. The method of embodiment D38, comprising normalizing counts of themapped nucleotide sequence reads, thereby generating normalized counts.

D40. The method of embodiment D39, wherein the normalizing comprisesLOESS normalization of guanine and cytosine (GC) bias (GC-LOESSnormalization).

D41. The method of embodiment D39 or D40, wherein the normalizingcomprises adjusting a sequence read count according to a median count.

D41.1 The method of embodiment D41, wherein the sequence read count isadjusted according to a median portion count.

D42. The method of any one of embodiments D39 to D41.1, wherein thenormalizing comprises a principal component normalization.

D43. The method of any one of embodiments D39 to D42, wherein thenormalizing comprises GC-LOESS normalization followed by normalizationaccording to a median portion count followed by a principal componentnormalization.

D44. The method of any one of embodiments D39 to D43, comprisingdetermining the presence or absence of a genetic variation for the testsample according to the normalized counts.

D44.1 The method of any one of embodiments D39 to D44, comprisingdetermining a chromosome structure according to the normalized counts.

D44.2 The method of any one of embodiments D39 to D44.1, wherein thenormalized counts are representative of chromosome dosage for the testsample.

D44.3 The method of embodiment D44.2, wherein determining a presence orabsence of a genetic variation is according to the chromosome dosage.

D44.4 The method of any one of embodiments D44 to D44.3, whereindetermining the presence or absence of a genetic variation for the testsample comprises identifying the presence or absence of one copy of achromosome, two copies of a chromosome, three copies of a chromosome,four copies of a chromosome, five copies of a chromosome, a deletion ofone or more segments of a chromosome or an insertion of one or moresegments of chromosome.

D45. The method of embodiment D38, wherein the method does not comprisenormalizing counts of the mapped nucleotide sequence reads.

D46. The method of embodiment D45, comprising determining the presenceor absence of a genetic variation for the test sample according to rawcounts of the mapped nucleotide sequence reads.

D47. The method of embodiment D45 or D46, comprising determining achromosome structure according to raw counts of the mapped nucleotidesequence reads.

D48. The method of embodiment D46 or D47, wherein the raw counts arerepresentative of chromosome dosage for the test sample.

D49. The method of embodiment D48, wherein determining a presence orabsence of a genetic variation is according to the chromosome dosage.

D50. The method of any one of embodiments D46 to D49, whereindetermining the presence or absence of a genetic variation for the testsample comprises identifying the presence or absence of one copy of achromosome, two copies of a chromosome, three copies of a chromosome,four copies of a chromosome, five copies of a chromosome, a deletion ofone or more segments of a chromosome or an insertion of one or moresegments of chromosome.

E1. A method for partitioning one or more genomic regions of a referencegenome into a plurality of portions comprising:

-   -   a) determining sequencing coverage variability across a        reference genome;    -   b) selecting an initial portion length;    -   c) partitioning at least two genomic regions according to the        initial portion length in (b);    -   d) determining a region-specific fetal fraction for each genomic        region according to a correlation between nucleotide sequence        read counts per portion and a weighting factor;    -   e) determining a local minimum genomic region size; and    -   f) adjusting the number of portions for each genomic region to        comprise at least two portions, thereby generating a        re-partitioned genomic region.

E1.1 A method for identifying the presence or absence of a geneticvariation comprising quantifying nucleotide sequence reads for a testsample, which sequence reads are mapped to one or more genomic regionsof a reference genome that have been partitioned by a processcomprising:

-   -   a) determining sequencing coverage variability across a        reference genome;    -   b) selecting an initial portion length;    -   c) partitioning at least two genomic regions according to the        initial portion length in (b);    -   d) determining a region-specific fetal fraction for each genomic        region according to a correlation between nucleotide sequence        read counts per portion and a weighting factor;    -   e) determining a local minimum genomic region size; and    -   f) adjusting the number of portions for each genomic region to        comprise at least two portions, thereby generating a        re-partitioned genomic region.

E2. The method of embodiment E1 or E1.1, wherein determining thesequencing coverage variability in (a) comprises use of a training setof nucleotide sequence reads mapped to portions of a reference genome,which sequence reads are reads of circulating cell-free nucleic acidfrom a plurality of samples from pregnant females bearing a fetus.

E3. The method of embodiment E1 or E2, wherein the initial portionlength in (b) is selected according to sequencing depth.

E4. The method of embodiment E1, E2 or E3, wherein the initial portionlength in (b) is selected according to an average fetal fraction.

E5. The method of embodiment E4, wherein the average fetal fraction isdetermined using the training set.

E6. The method of any one of embodiments E1 to E5, wherein the initialportion length is between about 1 kb to about 1000 kb.

E7. The method of any one of embodiments E1 to E6, wherein the initialportion length is about 30 kb.

E8. The method of any one of embodiments E1 to E6, wherein the initialportion length is about 40 kb.

E9. The method of any one of embodiments E1 to E6, wherein the initialportion length is about 50 kb.

E9.1 The method of any one of embodiments E1 to E6, wherein the initialportion length is not 50 kb.

E10. The method of any one of embodiments E1 to E6, wherein the initialportion length is about 60 kb.

E11. The method of any one of embodiments E1 to E6, wherein the initialportion length is about 70 kb.

E11.1 The method of any one of embodiments E1 to E11, wherein a totalnumber of portions for a genome is determined according to the initialportion length in (b).

E12. The method of any one of embodiments E1 to E11.1, wherein the atleast two genomic regions comprise a first genomic region and a secondgenomic region.

E13. The method of embodiment E12, wherein the first genomic region andthe second genomic region are substantially similar in size.

E14. The method of any one of embodiments E1 to E13, wherein there-partitioned genomic region in (f) comprises portions of constantsize.

E15. The method of any one of embodiments E1 to E13, wherein there-partitioned genomic region in (f) comprises portions of varying size.

E16. The method of embodiment E14 or E15, wherein the re-partitionedgenomic region in (f) comprises portions having sizes of between about 1kb to about 1000 kb.

E17. The method of embodiment E14 or E15, wherein the re-partitionedgenomic region in (f) comprises portions having sizes of about 30 kb.

E18. The method of embodiment E14 or E15, wherein the re-partitionedgenomic region in (f) comprises portions having sizes of about 40 kb.

E18.1 The method of embodiment E14 or E15, wherein the re-partitionedgenomic region in (f) comprises portions having sizes of about 50 kb.

E19. The method of embodiment E14 or E15, wherein the re-partitionedgenomic region does not comprise 50 kb portions.

E20. The method of embodiment E14 or E15, wherein the re-partitionedgenomic region in (f) comprises portions having sizes of about 60 kb.

E21. The method of embodiment E14 or E15, wherein the re-partitionedgenomic region in (f) comprises portions having sizes of about 70 kb.

E22. The method of any one of embodiments E1 to E21, wherein determininga local minimum genomic region size in (e) comprises identifying a localgenomic region size that is detectable for a sample having an averagefetal fraction.

E23. The method of any one of embodiments E1 to E22, further comprising(g) re-estimating fetal fraction from the re-partitioned genomic region.

E24. The method of embodiment E23, comprising comparing theregion-specific fetal fraction in (d) to the re-estimated fetal fractionin (g).

E25. The method of embodiment E24, comprising repeating parts (d), (e)and (f) when the region-specific fetal fraction in (d) differs from there-estimated fetal fraction in (g) by a predetermined tolerance value.

E25.1 The method of embodiment E25, wherein the predetermined tolerancevalue is between about 1% to about 25%.

E26. The method of any one of embodiments E1 to E25.1, comprisingsequencing nucleic acid from a test sample by a nucleotide sequencingprocess to generate nucleotide sequence reads.

E27. The method of embodiment E26, wherein the nucleic acid iscirculating cell-free nucleic acid from a pregnant female bearing afetus.

E28. The method of any one of embodiments E1 to E27, comprising mappingnucleotide sequence reads from a test sample to re-partitioned portionsof a reference genome, thereby generating mapped nucleotide sequencereads.

E29. The method of embodiment E28, comprising normalizing counts of themapped nucleotide sequence reads, thereby generating normalized counts.

E30. The method of embodiment E29, wherein the normalizing comprisesLOESS normalization of guanine and cytosine (GC) bias (GC-LOESSnormalization).

E31. The method of embodiment E29 or E30, wherein the normalizingcomprises adjusting a sequence read count according to a median count.

E31.1 The method of embodiment E31, wherein the sequence read count isadjusted according to a median portion count.

E32. The method of any one of embodiment E29 to E31.1, wherein thenormalizing comprises a principal component normalization.

E33. The method of any one of embodiments E29 to E32, wherein thenormalizing comprises GC-LOESS normalization followed by normalizationaccording to a median portion count followed by a principal componentnormalization.

E34. The method of any one of embodiments E29 to E33, comprisingdetermining the presence or absence of a genetic variation for the testsample according to the normalized counts.

E34.1 The method of any one of embodiments E29 to E34, comprisingdetermining a chromosome structure according to the normalized counts.

E34.2 The method of any one of embodiments E29 to E34.1, wherein thenormalized counts are representative of chromosome dosage for the testsample.

E34.3 The method of embodiment E34.2, wherein determining a presence orabsence of a genetic variation is according to the chromosome dosage.

E34.4 The method of any one of embodiments E34 to E34.3, whereindetermining the presence or absence of a genetic variation for the testsample comprises identifying the presence or absence of one copy of achromosome, two copies of a chromosome, three copies of a chromosome,four copies of a chromosome, five copies of a chromosome, a deletion ofone or more segments of a chromosome or an insertion of one or moresegments of chromosome.

E35. The method of embodiment E28, wherein the method does not comprisenormalizing counts of the mapped nucleotide sequence reads.

E36. The method of embodiment E35, comprising determining the presenceor absence of a genetic variation for the test sample according to rawcounts of the mapped nucleotide sequence reads.

E37. The method of embodiment E35 or E36, comprising determining achromosome structure according to raw counts of the mapped nucleotidesequence reads.

E38. The method of embodiment E36 or E37, wherein the raw counts arerepresentative of chromosome dosage for the test sample.

E39. The method of embodiment E38, wherein determining a presence orabsence of a genetic variation is according to the chromosome dosage.

E40. The method of any one of embodiments E36 to E39, whereindetermining the presence or absence of a genetic variation for the testsample comprises identifying the presence or absence of one copy of achromosome, two copies of a chromosome, three copies of a chromosome,four copies of a chromosome, five copies of a chromosome, a deletion ofone or more segments of a chromosome or an insertion of one or moresegments of chromosome.

F1. A method for partitioning a reference genome, or part thereof, intoa plurality of portions comprising:

-   -   a) generating a guanine and cytosine (GC) profile for a        reference genome, or part thereof;    -   b) applying a segmenting process to the GC profile generated in        (a), thereby providing discrete segments; and    -   c) partitioning the reference genome, or part thereof, into a        plurality of portions according to the discrete segments        provided in (b), thereby generating a GC partitioned reference        genome, or part thereof.

F1.1 A method for identifying the presence or absence of a geneticvariation comprising quantifying nucleotide sequence reads for a testsample, which sequence reads are mapped to a reference genome, or partthereof, that has been partitioned by a process comprising:

-   -   a) generating a guanine and cytosine (GC) profile for a        reference genome, or part thereof;    -   b) applying a segmenting process to the GC profile generated in        (a), thereby providing discrete segments; and    -   c) partitioning the a reference genome, or part thereof, into a        plurality of portions according to the discrete segments        provided in (b), thereby generating a GC partitioned reference        genome, or part thereof.

F1.2 The method of embodiment F1 or F1.1, comprising partitioning achromosome, or segment of a chromosome, from the reference genome,thereby generating a GC partitioned chromosome, or GC partitionedchromosome segment.

F2. The method of embodiment F1, F1.1 or F1.2, wherein the GC profile in(a) comprises GC content levels determined for each 1 kb of nucleotidesequence in the reference genome.

F3. The method of embodiment F2, wherein the segmenting process in (b)is performed on the GC content levels.

F4. The method of embodiment F3, wherein 1 kb nucleotide sequenceshaving similar GC content levels are merged into the discrete segments.

F5. The method of any one of embodiments F1 to F4, wherein thesegmenting process in (b) generates a decomposition rendering comprisingthe discrete segments.

F5.1 The method of any one of embodiments F1 to F5, wherein thesegmenting process in (b) is performed according to a level ofdecomposition based on chromosome length (L_(chr)) and minimum portionlength (L_(min)).

F6. The method of any one of embodiments F1 to F5.1, wherein thesegmenting process in (b) comprises Haar wavelet segmentation.

F7. The method of any one of embodiments F1 to F6, wherein the pluralityof portions comprises portions of varying size.

F8. The method of embodiment F7, wherein the plurality of portionscomprises portions having sizes between about 30 kb to about 300 kb.

F9. The method of embodiment F7, wherein the plurality of portionscomprises portions of about 32 kb.

F10. The method of embodiment F7, wherein the plurality of portionscomprises portions of about 64 kb.

F11. The method of embodiment F7, wherein the plurality of portionscomprises portions of about 128 kb.

F12. The method of embodiment F7, wherein the plurality of portionscomprises portions of about 256 kb.

F13. The method of embodiment F7, wherein the plurality of portions doesnot comprise 50 kb portions.

F14. The method of any one of embodiments F1 to F13, comprisingdetermining a GC content for the discrete segments in (b).

F15. The method of any one of embodiments F1 to F14, comprisingsequencing nucleic acid from a test sample by a nucleotide sequencingprocess to generate nucleotide sequence reads.

F16. The method of embodiment F15, wherein the nucleic acid iscirculating cell-free nucleic acid from a pregnant female bearing afetus.

F17. The method of any one of embodiments F1 to F16, comprising mappingnucleotide sequence reads from a test sample to portions of a GCpartitioned reference genome, thereby generating mapped nucleotidesequence reads.

F18. The method of embodiment F17, comprising normalizing counts of themapped nucleotide sequence reads, thereby generating normalized counts.

F19. The method of embodiment F18, wherein the normalizing comprisesLOESS normalization of guanine and cytosine (GC) bias (GC-LOESSnormalization).

F20. The method of embodiment F17 or F18, wherein the normalizingcomprises adjusting a sequence read count according to a median count.

F20.1 The method of embodiment F20, wherein the sequence read count isadjusted according to a median portion count.

F21. The method of any one of embodiments F17 to F20.1, wherein thenormalizing comprises a principal component normalization.

F22. The method of any one of embodiments F18 to F21, wherein thenormalizing comprises GC-LOESS normalization followed by normalizationaccording to a median portion count followed by a principal componentnormalization.

F23. The method of any one of embodiments F18 to F22, comprisingdetermining the presence or absence of a genetic variation for the testsample according to the normalized counts.

F23.1 The method of any one of embodiments F18 to F23, comprisingdetermining a chromosome structure according to the normalized counts.

F23.2 The method of any one of embodiments F18 to F23.1, wherein thenormalized counts are representative of chromosome dosage for the testsample.

F23.3 The method of embodiment F23.2, wherein determining a presence orabsence of a genetic variation is according to the chromosome dosage.

F23.4 The method of any one of embodiments F23 to F23.3, whereindetermining the presence or absence of a genetic variation for the testsample comprises identifying the presence or absence of one copy of achromosome, two copies of a chromosome, three copies of a chromosome,four copies of a chromosome, five copies of a chromosome, a deletion ofone or more segments of a chromosome or an insertion of one or moresegments of chromosome.

F24. The method of embodiment F17, wherein the method does not comprisenormalizing counts of the mapped nucleotide sequence reads.

F25. The method of embodiment F24, comprising determining the presenceor absence of a genetic variation for the test sample according to rawcounts of the mapped nucleotide sequence reads.

F26. The method of embodiment F24 or F25, comprising determining achromosome structure according to raw counts of the mapped nucleotidesequence reads.

F27. The method of embodiment F25 or F26, wherein the raw counts arerepresentative of chromosome dosage for the test sample.

F28. The method of embodiment F27, wherein determining a presence orabsence of a genetic variation is according to the chromosome dosage.

F29. The method of any one of embodiments F25 to F28, whereindetermining the presence or absence of a genetic variation for the testsample comprises identifying the presence or absence of one copy of achromosome, two copies of a chromosome, three copies of a chromosome,four copies of a chromosome, five copies of a chromosome, a deletion ofone or more segments of a chromosome or an insertion of one or moresegments of chromosome.

The entirety of each patent, patent application, publication anddocument referenced herein hereby is incorporated by reference. Citationof the above patents, patent applications, publications and documents isnot an admission that any of the foregoing is pertinent prior art, nordoes it constitute any admission as to the contents or date of thesepublications or documents.

Modifications may be made to the foregoing without departing from thebasic aspects of the technology. Although the technology has beendescribed in substantial detail with reference to one or more specificembodiments, those of ordinary skill in the art will recognize thatchanges may be made to the embodiments specifically disclosed in thisapplication, yet these modifications and improvements are within thescope and spirit of the technology.

The technology illustratively described herein suitably may be practicedin the absence of any element(s) not specifically disclosed herein.Thus, for example, in each instance herein any of the terms“comprising,” “consisting essentially of,” and “consisting of” may bereplaced with either of the other two terms. The terms and expressionswhich have been employed are used as terms of description and not oflimitation, and use of such terms and expressions do not exclude anyequivalents of the features shown and described or portions thereof, andvarious modifications are possible within the scope of the technologyclaimed. The term “a” or “an” can refer to one of or a plurality of theelements it modifies (e.g., “a reagent” can mean one or more reagents)unless it is contextually clear either one of the elements or more thanone of the elements is described. The term “about” as used herein refersto a value within 10% of the underlying parameter (i.e., plus or minus10%), and use of the term “about” at the beginning of a string of valuesmodifies each of the values (i.e., “about 1, 2 and 3” refers to about 1,about 2 and about 3). For example, a weight of “about 100 grams” caninclude weights between 90 grams and 110 grams. Further, when a listingof values is described herein (e.g., about 50%, 60%, 70%, 80%, 85% or86%) the listing includes all intermediate and fractional values thereof(e.g., 54%, 85.4%). Thus, it should be understood that although thepresent technology has been specifically disclosed by representativeembodiments and optional features, modification and variation of theconcepts herein disclosed may be resorted to by those skilled in theart, and such modifications and variations are considered within thescope of this technology.

Certain embodiments of the technology are set forth in the claim(s) thatfollow(s).

What is claimed is:
 1. A method for identifying a presence or absence ofa genetic variation, comprising: a) determining sequencing coveragevariability across a reference genome; b) selecting an initial portionlength; c) partitioning at least two genomic regions according to theinitial portion length in (b); d) comparing the sequencing coveragevariability determined in (a) for each of the at least two genomicregions, thereby generating a comparison; e) recalculating the number ofportions for at least one of the genomic regions according to thecomparison in (d), thereby determining an optimized portion length; f)re-partitioning at least one of the genomic regions into a plurality ofportions according to the optimized portion length in (e), therebygenerating at least one re-partitioned genomic region; g) mappingnucleotide sequence reads from a test sample to the plurality ofportions of the at least one re-partitioned genomic region, therebygenerating mapped nucleotide sequence reads; and h) determining thepresence or absence of the genetic variation for the test sampleaccording to raw counts or normalized counts of the mapped nucleotidesequence reads.
 2. The method of claim 1, wherein determining thesequencing coverage variability in (a) comprises use of a training setof nucleotide sequence reads mapped to portions of a reference genome,which sequence reads are reads of circulating cell-free nucleic acidfrom a plurality of samples from pregnant females bearing a fetus. 3.The method of claim 2, wherein the initial portion length in (b) isselected according to sequencing depth for the training set.
 4. Themethod of claim 2, wherein the initial portion length in (b) is selectedaccording to an average fetal fraction for the training set.
 5. Themethod of claim 4, wherein the average fetal fraction is determinedusing the training set.
 6. The method of claim 1, wherein the initialportion length is between about 1 kb to about 1000 kb.
 7. The method ofclaim 1, wherein a total number of portions for a genome is determinedaccording to the initial portion length in (b).
 8. The method of claim1, wherein the at least two genomic regions comprise a first genomicregion and a second genomic region, and wherein the first genomic regionand the second genomic region are substantially similar in size.
 9. Themethod of claim 8, wherein the sequencing coverage variability of thefirst genomic region is determined from a nucleotide sequence readcount, or a derivative thereof, for the first genomic region, andwherein sequencing coverage variability of the second genomic region isdetermined from a nucleotide sequence read count, or a derivativethereof, for the second genomic region.
 10. The method of claim 8,wherein the sequencing coverage variability of the first genomic regionis determined from an average nucleotide sequence read count, or aderivative thereof, for the first genomic region, and wherein sequencingcoverage variability of the second genomic region is determined from anaverage nucleotide sequence read count, or a derivative thereof, for thesecond genomic region.
 11. The method of claim 1, further comprising: i)determining a region-specific fetal fraction for each genomic regionaccording to a correlation between nucleotide sequence read counts perportion and a weighting factor; j) determining a local minimum genomicregion size based on the region-specific fetal fraction; and k)adjusting the number of portions for each of the at least two genomicregions to comprise at least two portions based on the local minimumgenomic region size, thereby generating a refined re-partitioned genomicregion, wherein mapping the nucleotide sequence reads comprises mappingthe nucleotide sequence reads from the test sample to the plurality ofportions of the at least one refined re-partitioned genomic region fromthe test sample to the plurality of portions of the at least onere-partitioned genomic region.
 12. The method of claim 1, furthercomprising: i) estimating a fetal fraction for the test sample from apregnant female bearing a fetus; j) determining a region-specific fetalfraction for each genomic region according to a correlation betweennucleotide sequence read counts per portion and a weighting factor; k)determining a local minimum genomic region size based on theregion-specific fetal fraction; and l) adjusting the number of portionsfor each of the at least two genomic regions to comprise at least twoportions based on the local minimum genomic region size, therebygenerating a refined re-partitioned genomic region, wherein mapping thenucleotide sequence reads comprises mapping the nucleotide sequencereads from the test sample to the plurality of portions of the at leastone refined re-partitioned genomic region from the test sample to theplurality of portions of the at least one re-partitioned genomic region.13. A system for identifying a presence or absence of a geneticvariation, the system comprising: one or more data processors; and anon-transitory computer readable storage medium containing instructionswhich, when executed on the one or more data processors, cause the oneor more data processors to perform actions including: a) determiningsequencing coverage variability across a reference genome; b) selectingan initial portion length; c) partitioning at least two genomic regionsaccording to the initial portion length in (b); d) comparing thesequencing coverage variability determined in (a) for each of the atleast two genomic regions, thereby generating a comparison; e)recalculating the number of portions for at least one of the genomicregions according to the comparison in (d), thereby determining anoptimized portion length; f) re-partitioning at least one of the genomicregions into a plurality of portions according to the optimized portionlength in (e), thereby generating at least one re-partitioned genomicregion; g) mapping nucleotide sequence reads from a test sample to theplurality of portions of the at least one re-partitioned genomic region,thereby generating mapped nucleotide sequence reads; and h) determiningthe presence or absence of the genetic variation for the test sampleaccording to raw counts or normalized counts of the mapped nucleotidesequence reads.
 14. The system of claim 13, wherein determining thesequencing coverage variability in (a) comprises use of a training setof nucleotide sequence reads mapped to portions of a reference genome,which sequence reads are reads of circulating cell-free nucleic acidfrom a plurality of samples from pregnant females bearing a fetus. 15.The system of claim 14, wherein the initial portion length in (b) isselected according to sequencing depth for the training set or anaverage fetal fraction for the training set.
 16. The system of claim 13,wherein the actions further include: i) determining a region-specificfetal fraction for each genomic region according to a correlationbetween nucleotide sequence read counts per portion and a weightingfactor; j) determining a local minimum genomic region size based on theregion-specific fetal fraction; and k) adjusting the number of portionsfor each of the at least two genomic regions to comprise at least twoportions based on the local minimum genomic region size, therebygenerating a refined re-partitioned genomic region, wherein mapping thenucleotide sequence reads comprises mapping the nucleotide sequencereads from the test sample to the plurality of portions of the at leastone refined re-partitioned genomic region from the test sample to theplurality of portions of the at least one re-partitioned genomic region.17. The system of claim 13, wherein the actions further include: i)estimating a fetal fraction for the test sample from a pregnant femalebearing a fetus; j) determining a region-specific fetal fraction foreach genomic region according to a correlation between nucleotidesequence read counts per portion and a weighting factor; k) determininga local minimum genomic region size based on the region-specific fetalfraction; and l) adjusting the number of portions for each of the atleast two genomic regions to comprise at least two portions based on thelocal minimum genomic region size, thereby generating a refinedre-partitioned genomic region, wherein mapping the nucleotide sequencereads comprises mapping the nucleotide sequence reads from the testsample to the plurality of portions of the at least one refinedre-partitioned genomic region from the test sample to the plurality ofportions of the at least one re-partitioned genomic region.
 18. Acomputer-program product tangibly embodied in a non-transitorymachine-readable storage medium, including instructions configured tocause one or more data processors to perform actions including: a)determining sequencing coverage variability across a reference genome;b) selecting an initial portion length; c) partitioning at least twogenomic regions according to the initial portion length in (b); d)comparing the sequencing coverage variability determined in (a) for eachof the at least two genomic regions, thereby generating a comparison; e)recalculating the number of portions for at least one of the genomicregions according to the comparison in (d), thereby determining anoptimized portion length; f) re-partitioning at least one of the genomicregions into a plurality of portions according to the optimized portionlength in (e), thereby generating at least one re-partitioned genomicregion; g) mapping nucleotide sequence reads from a test sample to theplurality of portions of the at least one re-partitioned genomic region,thereby generating mapped nucleotide sequence reads; and h) determiningthe presence or absence of the genetic variation for the test sampleaccording to raw counts or normalized counts of the mapped nucleotidesequence reads.
 19. The computer-program product of claim 18, whereinthe actions further include: i) determining a region-specific fetalfraction for each genomic region according to a correlation betweennucleotide sequence read counts per portion and a weighting factor; j)determining a local minimum genomic region size based on theregion-specific fetal fraction; and k) adjusting the number of portionsfor each of the at least two genomic regions to comprise at least twoportions based on the local minimum genomic region size, therebygenerating a refined re-partitioned genomic region, wherein mapping thenucleotide sequence reads comprises mapping the nucleotide sequencereads from the test sample to the plurality of portions of the at leastone refined re-partitioned genomic region from the test sample to theplurality of portions of the at least one re-partitioned genomic region.20. The computer-program product of claim 18, wherein the actionsfurther include: i) estimating a fetal fraction for the test sample froma pregnant female bearing a fetus; j) determining a region-specificfetal fraction for each genomic region according to a correlationbetween nucleotide sequence read counts per portion and a weightingfactor; k) determining a local minimum genomic region size based on theregion-specific fetal fraction; and l) adjusting the number of portionsfor each of the at least two genomic regions to comprise at least twoportions based on the local minimum genomic region size, therebygenerating a refined re-partitioned genomic region, wherein mapping thenucleotide sequence reads comprises mapping the nucleotide sequencereads from the test sample to the plurality of portions of the at leastone refined re-partitioned genomic region from the test sample to theplurality of portions of the at least one re-partitioned genomic region.